Uncategorized – Arthur's blog

What do client diagnostics in Cosmos DB mean?

This topic keeps coming up with my customers so the purpose of this blog post is to keep all the useful information in one post. Think of this post as a consolidation of different pieces of information. If you’re stuck on what “pipelined”, “created”, or “transit time” means, you’re in the right spot.

We are not talking about Diagnostic Settings/Log Analytics in this post, I’d describe those as server-side rather than client-side diagnostics. They are useful in different ways, our client diagnostics will help us understand the path that a request takes from the application to Cosmos DB and back, along with any latency on that path.

Note: before we get started, check to see if you’re using Direct or Gateway mode. Ideally sending your requests Direct mode in the .NET or Java SDKs will usually result in faster requests.

How do we get a Cosmos DB Client Diagnostic?

In order to understand a diagnostic log, you need to capture it. To do so, you simply return the diagnostic object from any request to Cosmos DB. All responses, successful and unsuccessful will generate a diagnostic for your application.

Client Diagnostics for .NET SDK for Cosmos DB:

Client Diagnostics for Java SDK for Cosmos DB

Now that I have diagnostics in my application, let’s interpret them

At this point you should have a JSON object that contains words like “ChannelAcquisitionStarted” and “LatencyInMS.” We can simply follow the descriptions here to understand each of the steps in the connection, however a couple notes before we begin. LatencyInMS represents the duration of the whole request. BELatency or BackendLatency is the duration on the server. It’s normal for those two values to be very different.

From the article linked above, pasted here for convenience:

"ChannelAcquisitionStarted: The time to get or create a new connection. Connections can be created for numerous reasons such as: The previous connection was closed due to inactivity using CosmosClientOptions.IdleTcpConnectionTimeout, the volume of concurrent requests exceeds the CosmosClientOptions.MaxRequestsPerTcpConnection, the connection was closed due to a network error, or the application is not following the Singleton pattern and new instances are constantly created. Once a connection is established, it is reused for subsequent requests, so this should not impact P99 latency unless the previously mentioned issues are happening.

Pipelined: The time spent writing the request into the TCP socket. Request can only be written on a TCP socket one at a time, a large value indicates a bottleneck on the TCP socket which is commonly associated with locked threads by the application code or large requests size.

Transit time: The time spent on the network after the request was written on the TCP socket. Compare this number to the BELatencyInMs. If BELatencyInMs is small, then the time was spent on the network, and not on the Azure Cosmos DB service. If the request failed with a timeout, it indicates how long the client waited without response, and the source is network latency.

Received: The time between the response was received and processed by the SDK. A large value is normally caused by a thread starvation or locked threads."

Based on those descriptions, here’s some simple next steps if you have an issue:

If most of the time is spent in Transit Time: It’s either slow execution on the server or slow network traffic. Check BE/BackendLatency to see if it’s slow server-side. Also check request and response size to see if either one is abnormally large.
If most of the time is before Transit: It could be a problem acquiring the connection on the network or it could be a very busy application server. Check the memory and CPU statistics to see if your client has enough resources.

What can I do if BE/BackendLatency is high?

There’s a lot of room for tuning in Cosmos DB. My troubleshooting changes dramatically depending on the type of operation.

For slow queries, check the indexing policy. If it’s set to the default /*, the only options are to add composite indexes or consider reducing the size of the request and result set.
For slow create, update, and delete, consider reducing the number of indexes. The default /* indexing policy can mean that requests cost more RU than necessary so remove any unnecessary indexes after evaluating the indexes used by your application.
For slow reads (point reads), consider checking your document size. This one is the most difficult to troubleshoot but excessively large documents may not be optimal.

Other miscellaneous fields in the Client Diagnostic

There’s often a bunch of abbreviations at the bottom of the Client Diagnostic, like PT0 or PT1H. I always found these confusing. These values are actually telling your application’s connection configurations and the details can be found here:

.NET Direct mode

Java Direct mode

I hope this was useful, I always end up searching through 4 or 5 pages to consolidate this information. Stay tuned!

Cosmos DB – Monitoring performance

Cosmos DB has become my latest focus and I hope to start sharing more about ways you can use it. Cosmos DB doesn’t provide the same granularity of query plans as you can get in SQL Server, but it actually does provide some super useful data in the Portal and in Log Analytics.

The first thing to know about Cosmos DB is that everything is measured in Request Units (RU). RUs are a measure of CPU, memory, and disk to perform the operation and return the document. Please especially make note of the second part of that. If you return a lot of documents or exceptionally large documents, your RU charges will be high.

Queries, reads, inserts, updates, and deletes can all be measured by the RUs necessary to complete the action. This means that you want to keep track of the Request Charge value in our queries. The other metric that’s important to track is the Server-Side Latency. There should usually be a correlation between RU charge and Server-Side Latency of the request.

What’s next?

Now that you know the basics of a request charge, we need to get into Metrics and Diagnostic Settings. The default aggregation is usually Average or Sum but I like to start by measuring the Max of Total Request Units in the Metrics tab. Then add in splitting by the Operation Type and filter or split by the Collection and Database. Combining these metrics will give you a great idea of the largest requests sent to Cosmos DB and what type.

Once we have that view, we’re looking for any exceptional activities. Things like a 10,000 RU Query or a 5,000 RU Upsert would certainly be worth investigating. Once you have an idea of where to look by using these Metrics, we will really need Diagnostic Settings. There’s a charge to collecting Diagnostic Settings and they have to be enabled (before the issue occurred so you have data on that issue), but once you have that data, you’re in a great position to start analyzing any issues.

If you’re using the NoSQL API, start by using the the example queries on this page. Just remember that when you include CDBQueryRuntimeStatistics, you will be limited to just queries and not inserts/writes/upserts/etc. To get that data, only use CDBDataPlaneRequests.

For further reading:
https://devblogs.microsoft.com/cosmosdb/cost-optimized-metrics-through-log-analytics/

Hope this was useful! In a future post, I’d like to dive deeper into how you can performance tune a high RU request. Stay tuned!

Presentation 5: Presenting Performance Tuning with Memory Grants

So it’s about a week after SQL Saturday Chicago 2019. The event was amazing, tons of great people and in a great venue. Here’s the details of what happened between Presentation 4 and SQL Saturday Chicago.

Final steps to prepare

Between the last post, presentation 4, and that day, I refined a lot of the presentation. Then I presented it to my team. They were very supportive, which was comforting. I also learned what parts I would need to practice.

While practicing on Friday evening, I found that my system would freeze when I ran my resource semaphore demo. Uh oh. There wasn’t enough time for me to completely debug the demo, although I could see that when it happened, my computer’s hard drive was at 100%.

Rushed for time

Resource semaphore waits are really important for my memory grant presentation. They’re the big problem that’s caused by excessive memory grants, and I wanted to at least show what it looked like. However, with only a few hours remaining, I had to find an alternate solution.

I decided to capture a screenshot during the issues, as a way to prove that the issue was occurring. Even if my system wouldn’t respond, I could show the screenshot.

Here’s that screenshot:

SQL Saturday Chicago and my presentation

Then, on Saturday 3/23, I presented in front of a full room. I was really happy to have so many attendees, and the room had a lot of questions.

And, as expected I needed to use the screenshot. Not good! But it’s better than nothing, and I could show in my later demos that the queries would run with the improvements I suggested.

Overall thoughts

The presentation itself was a success, despite the demo issue. I have time now to do some troubleshooting on my laptop, and I have the screenshots for the next time I present.

One week blogging hiatus

I’m taking a brief break from daily blogging, even though it has been a ton of fun.

No excuses

I hope my blog has been enjoyable to read. I’ve learned a lot of things so far, it really has been a great few months of blogging.

Daily work stress and the amount of hours I’ve put in over the last week and this weekend have left me exhausted. I love what I do, but the extra commitment of writing a daily blog post(s) over a weekend or after work is currently too much.

Thank you to everyone who reads my blog, and thanks especially to the people who have left comments, shared my posts, and complimented my posts to me. It means so much to me to get feedback.

Where I’m going from here

There are no other planned posts this week while I recover from a very long week. That said, I want to continue two blog series I’ve been working on, the memory grants and the presentation I’m writing. There may be posts on those two subjects this week or next week!

I feel like I’ve only scratched the surface of query memory grants and I love the idea of passing along what I’ve been learning. I also really like sharing my presentation, even when it’s very early and the information isn’t clear.

Anyway, thanks for staying tuned! I’ll be re-focusing next weekend and come back to blogging after a short break.

Favorite community scripts and their parameters

Here’s some of my favorite community scripts and how I like to run them. If you haven’t tried the optional parameters on your favorite scripts, there’s definitely some cool features to try.

sp_WhoIsActive

Written by Adam Machanic and hosted at whoisactive.com

This script is great for all kinds of performance troubleshooting. Here’s the parameter I add on: @get_plans = 1

This will slow the procedure down a bit, but it can be super useful in finding the execution plans of currently running queries. Here’s a quick before and after.

sp_WhoIsActive;

sp_WhoIsActive @get_plans = 1;

Awesome! Now we have the query plan. It won’t contain actual row count or actual executions, although I’d stay tuned for SQL Server 2019 and see what happens for live query executions.

sp_BlitzCache

It’s no secret that I’m a big fan of the First Responder Kit from Brent Ozar Unlimited, but this procedure in particular is amazing.

When you run sp_BlitzCache with no parameters, you get the top 10 plans from cache along with some information on your plan cache’s health. Add one of my two favorite parameters for more specific searching.

sp_BlitzCache @MinimumExecutionCount = 100;

Minimum execution count will only analyze queries with your number or greater executions. That’s great for troubleshooting systems with high batch requests per second, because you can exclude expensive queries that only run once or twice a day. Pick the right number based on your system.

sp_BlitzCache @OnlyQueryHashes = ''

I’m a big fan of using the query hash to identify and track problematic queries. Check out my post for why I like it so much. If you have one or more query hashes that you want to track, you can list them here in a comma separated list.

Let me know if you have other favorite procedures or parameters! Stay tuned.

CTRL + F in SSMS execution plans

I just discovered this the other day and I had to share it.

First, we need a query in Management Studio

USE StackOverflow2010
GO

SELECT *
FROM Posts as p
JOIN Users as u on u.Id = p.LastEditorUserId
WHERE u.Id = 1

Okay, now I’ll get the execution plan. It looks like this:

Now, click on the execution plan and use the keyboard shortcut, CTRL+F

This menu should come up.

If we open the ActualCPUms window, we get even more options.

And now for a demo using it:

Let’s say we want to find all the operators that used the Posts table. Simply use Contains Posts, and use the arrows to go through the operators. See the video demo below.

I recommend opening it in full screen:

SSMS tip: Splitting the same query window

This is a trick I use in SQL Server Management Studio at least once a week.

Today’s scripts will be screenshots of Management Studio to show the trick.

Here’s our super long script:

That’s ok, but now we want to see the SELECT from Users next to the SELECT from PostTypes. We can do that really easily with this trick.

Click and drag right here

And now, clicking and dragging. Then scrolling a bit, and I can align the two SELECTS like this:

To me, that’s a lot easier to read. You essentially get two copies of the same window.

They can show the same area at the same time

Okay, hope this was useful. Have a good weekend!

Stay tuned!

A short guide to sp_executesql

Last week I talked about single use plans. One way to increase execution plan re-use is to parameterize queries by wrapping them in sp_executesql.

I’ve started using this stored procedure enough that I wanted to write a short guide on converting a query into sp_executesql.

Our query before sp_executesql

SELECT CreationDate
FROM Users
WHERE DisplayName = N'Community'
AND Location = N'on the server farm'

So there’s two values in the WHERE clause that could change between queries, if we want to look for a different DisplayName or Location. Let’s parameterize those in sp_executesql.

EXEC sp_executesql N'
SELECT CreationDate
FROM Users
WHERE DisplayName = @DisplayNameParam
AND Location = @LocationParam'

That’s a good start, but now we need to tell the server what our parameters are. Add a comma, and add those after the first string like this:

EXEC sp_executesql N'
SELECT CreationDate
FROM Users
WHERE DisplayName = @DisplayNameParam
AND Location = @Location',
N'@DisplayNameParam NVARCHAR(40), @LocationParam NVARCHAR(100)'

Completing the example of sp_executesql

And finally, we need another two commas. This is the part where we add the values for the parameters, in the same order that we declared them. That means DisplayNameParam first, then LocationParam second.

EXEC sp_executesql N'
SELECT CreationDate
FROM Users
WHERE DisplayName = @DisplayNameParam
AND Location = @Location',
N'@DisplayNameParam NVARCHAR(40), @Location NVARCHAR(100)',
N'Community', N'on the server farm'

Perfect! Here’s the execution plan and the parameter list from the SELECT operator.

Changing the values to prove the execution plan will be re-used

Let’s look for a user that doesn’t exist. Here’s some parameters that won’t match any rows in StackOverflow2010.

EXEC sp_executesql N'
SELECT CreationDate
FROM Users
WHERE DisplayName = @DisplayNameParam
AND Location = @LocationParam',
N'@DisplayNameParam NVARCHAR(40), @LocationParam NVARCHAR(100)',
N'Not a real user', N'No location'

Here’s the same execution plan and parameter list. The compiled values are Community and on the server farm, showing that the first plan was re-used.

That’s it for an introduction to sp_executesql. You can add more parameters, I just chose to use two for this demo.

Stay tuned!

Database context and compatibility level

To add onto yesterday’s post about which cardinality estimator (CE) your query will use, there’s an additional complexity. This specifically applies to cross database queries.

The database context in which you run the query will determine which CE is used. Let’s look at an example.

First, check on the master and StackOverflow2010 compatibility levels

Querying a 3-part name: StackOverflow2010.dbo.Users

USE [StackOverflow2010]
GO

SELECT TOP 100 * 
FROM StackOverflow2010.dbo.Users AS u
WHERE u.Reputation = 1001

This is our baseline query from yesterday. As expected, when we look in the properties of the SELECT operator in the execution plan, we’ll see CE version 70.

Changing database context to master

In this example, I’ll use the master database but it could be any other database.

USE [master]
GO

SELECT TOP 100 * 
FROM StackOverflow2010.dbo.Users AS u
WHERE u.Reputation = 1001

So, now that we’re running in master, we used CE version 140. Now in this execution plan, it didn’t make a difference which CE was used. I apologize for not having a good demo of that yet.

I hope this helps when looking at upgrading compatibility levels. The CE version will change based on what database context was used.

Stay tuned!

Optimize for ad hoc VS sp_executesql?

Happy Friday! We made it. Here’s something I came across while testing optimize for ad hoc for this week’s blog posts.

First, the (better) way to use sp_executesql

The rest of this post will be using ad hoc queries with optimize for ad hoc ON at the several level. But first, take a look at using a parameter inside sp_executesql and what it shows inside the plan cache.

I want to say that this is not the best way to use sp_executesql. Here’s my query and how I could properly wrap it in sp_executesql.

--Here's the ad hoc version
SELECT Score FROM Posts AS p
JOIN Users as u ON u.Id = p.OwnerUserId  
WHERE p.Id = 13
GO
--And here's it wrapped it in sp_executesql
--with a variable, @PostId instead of the literal, 13
EXEC sp_executesql N'SELECT Score FROM Posts AS p  
JOIN Users as u ON u.Id = p.OwnerUserId
WHERE p.Id = @PostId', N'@PostId int', '13'
GO

And here’s the plan cache after running those two queries, with optimize for ad hoc on.

SELECT TOP 10 
cache.size_in_bytes,
cache.objtype,
stat.execution_count,
stext.text,
splan.query_plan
FROM sys.dm_exec_query_stats as stat
CROSS APPLY sys.dm_exec_sql_text(stat.sql_handle) as stext
CROSS APPLY sys.dm_exec_query_plan(stat.plan_handle) as splan
JOIN sys.dm_exec_cached_plans as cache on cache.plan_handle = stat.plan_handle

Okay, so this is as expected. The Adhoc query didn’t save the execution plan into cache, and the size is a lot smaller. But what if we didn’t replace 13 with the variable @PostId?

Leaving things as they are

Sometimes, we don’t have time to change the code. Sometimes the code is still using literals inside sp_executesql, essentially running ad hoc statements. Let’s take a look at that scenario.

--ad hoc version
SELECT Score FROM Posts AS p
JOIN Users as u ON u.Id = p.OwnerUserId  
WHERE p.Id = 13
GO


EXEC sp_executesql N'SELECT Score FROM Posts AS p  
JOIN Users as u ON u.Id = p.OwnerUserId
WHERE p.Id = 13'
GO

Here’s the results of our plan cache query.

Changing the order of statement – sp_executesql first

Now, you might wonder what happens if we change the order of the statements. Since I ran the sp_executesql as the second statement, maybe it cached that plan. Let’s run the sp_executefirst.


EXEC sp_executesql N'SELECT Score FROM Posts AS p  
JOIN Users as u ON u.Id = p.OwnerUserId
WHERE p.Id = 13'
GO

--ad hoc version
SELECT Score FROM Posts AS p
JOIN Users as u ON u.Id = p.OwnerUserId  
WHERE p.Id = 13
GO

So, why does sp_executesql cause ad hoc plans to be stored as full plans in cache?

I don’t know. I came across this while blogging and thought it was worth its own post. If you know why or know more, feel free to send me a message or leave a comment!

Stay tuned!

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30