Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Perf: construct & reuse clusters in output thread #3438

Closed
wants to merge 7 commits into from

Conversation

skyline75489
Copy link
Collaborator

Summary of the Pull Request

By constructing clusters for rendering early in the output thread, the rendering thread will be relieved, making overall performance better when under heavy load.

References

#3075

PR Checklist

  • Closes #xxx
  • CLA signed. If not, go over here and sign the CLA
  • Tests added/passed
  • Requires documentation to be updated
  • I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

This is an attempt to tackle the TODO listed here:

// TODO: MSFT: 20961091 - This is a perf issue. Instead of rebuilding this and allocing memory to hold the reinterpretation,

I may be doing a lots of thing wrong but the performance boost is so good that I gotta share it with you guys. With this PR, catting my /etc/services takes only about 1.5 second, comparing to current release which takes about 5 seconds to do the same thing.

Validation Steps Performed

@skyline75489
Copy link
Collaborator Author

skyline75489 commented Nov 5, 2019

And another reason is that I can't get one thing working. The rendering seems fine. But when I started typing something, the last line gets printed twice:

图片

I'm extremely confused about this. How does rendering somehow affect what's got printed by conpty?

I could really use some advice here. @miniksa @DHowett-MSFT

@skyline75489
Copy link
Collaborator Author

To show how good this is and get everybody's attention, I'm gonna attach some evidence.

Using code on master at 94213ad

Duration: 5.1s
CPU Usage:

图片

Using this PR:

Duration: 1.2s
CPU Usage:

图片

Both result are with Release build and on the same machine.

private:
// This is the UTF-16 string of characters that form a particular drawing cluster
const std::wstring_view _text;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the const here but I want to reuse existing _text and update the columns. So I had to make it this way.


// Ask the helper to paint through this specific line.
_PaintBufferOutputHelper(pEngine, it, screenLine.Origin());
const auto& rowData = buffer.GetRowByOffset(row);
Copy link
Collaborator Author

@skyline75489 skyline75489 Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method _PaintBufferOutput is supposed to handle all kinds of dirty rect refreshing. I kind of made it only working for rows, which is basically all that's needed right now. Should make it better, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably what ended up causing the duplicate conpty output - by just re-rendering the entire row at the time, that's causing the text to get duplicated on the Terminal side


size_t count = 0;
std::vector<Cluster> clusters;
clusters.reserve(runLength);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reserve here is intended to reduce vector reallocation and it works great. Without reserve, the performance is heavily penalized.

@@ -199,6 +217,19 @@ OutputCellIterator ROW::WriteCells(OutputCellIterator it, const size_t index, co
{
_charRow.DbcsAttrAt(currentIndex) = it->DbcsAttr();
_charRow.GlyphAt(currentIndex) = it->Chars();

if (it->DbcsAttr().IsDbcs())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind tricky but I need it to handle dbcs since I'm in CJK environment myself. A cluster with column set to zero will be ignored for rendering.


size_t runStart = 0;
// This outer loop will continue until we reach the end of the text we are trying to draw.
while (runIndex < runCount)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where magic happens. We have already processed the text buffer and got the clusters we need. All that left is to walk through the row and render each run.

@zadjii-msft
Copy link
Member

I'm extremely confused about this. How does rendering somehow affect what's got printed by conpty?

There's a longer explanation I can give you, but the tl;dr is that conpty works by effectively rendering the contents of the buffer to a stream of characters. This stream of characters includes VT sequences, which let the terminal on the other side re-construct the state of the console buffer.

I'd take a cursory look at VtEngine, which is the "render engine" that powers conpty. Unlike the GDI, DX engines which draw content to the screen, the VT engine (and it's subclasses) renders to a stream of characters. And while the graphical renderers don't really care if they repeat a particular operation, the VT engine is a little more carefully optimized to emit the minimal amount of text possible to recreate the state of the console, so adding steps to the renderer might negatively impact that. I'll take a closer look.

@miniksa
Copy link
Member

miniksa commented Nov 5, 2019

I like the idea and appreciate focus on this area as I know it's definitely a performance sink.

However, for this particular TODO, I was hoping we could get away without having a Clusters vector at all.

It looks like you saved us by going from generating-it-every-time to generating-on-change and caching it.

I was hoping that we could implement a solution that perhaps provided an iterator (like the TextBufferCellIterator) as the first parameter on PaintBufferLine (instead of a span) and the internal loop would iterate until the iterator was false instead of from begin to end of the span. Then we wouldn't have to "store" an array of clusters at all. We could perhaps wrap a TextBufferCellIterator into a RenderClusterIterator, tell the RenderClusterIterator that it needs to return operator bool() as false when the underlying TextBufferCellIterator::TextAttr changes on increment, and provides accessors only for the character string and column string into the underlying cell iterator (which is a view directly into the text buffer).

That should mean that we don't have to pre-pass anything and we don't have to allocate any additional memory (besides a little bit for the iterator overhead and logic).

Then we'd in theory eliminate the CPU time from the vector overhead while not blowing up our memory storage. Probably not exactly as good on CPU time as the extra memory, but a better balance.

@skyline75489
Copy link
Collaborator Author

@zadjii-msft Thanks for the explanation.

@miniksa I think I know what you mean. I'll give it a try and left this PR open here.

@skyline75489
Copy link
Collaborator Author

skyline75489 commented Nov 6, 2019

Another observation that surprised me. The vector thing is affecting my Ryzen 2400G desktop PC not my i7-8750H gaming laptop, which means this PR doesn't really make a difference(less than 1 second saved) on my laptop. Seems Win-tel thing is still there somehow.

Update: never mind. I guess it's about the VtEngine @zadjii-msft mentioned. On my laptap I'm using plain cat and on my PC I'm sshing into a virtual machine and use that cat.

@skyline75489 skyline75489 mentioned this pull request Nov 6, 2019
5 tasks
@skyline75489
Copy link
Collaborator Author

Closed in favor of #3458

@skyline75489 skyline75489 deleted the fix/reuseClusters branch February 9, 2021 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants