Go Back   Project64 Forums > General Discussion > Open Discussion

Reply
 
Thread Tools Display Modes
  #1  
Old 30th March 2016, 05:25 AM
Themaister Themaister is offline
Junior Member
 
Join Date: Mar 2016
Posts: 8
Default Compute shader based LLE RDP ideas

Wasn't sure of the best place to post this, but this is probably fine. This is my first post here.
Hopefully someone can comment on whether someone has tried going down this path before.

I've been looking at the latest efforts in LLE N64 the last days and the graphics programmer in me got interested when I started reading up on the RDP in more detail.

NOTE: I'm not an N64 emudev, I'm looking at this from the outside. Bear with me if some of the details are a bit off

Since I'm a (GP)GPU programmer I want to see if LLE RDP can be accelerated on the GPU somehow bit exact.
The naive "render triangles method" is inadequate I think and trying to do LLE with a HLE mindset is fundamentally flawed.

Looking at Angrylion's RDP plugin, the N64 clearly has an extremely specific rasterization and interpolation algorithm
which I believe is not possible to recreate exactly with a naive "draw triangle" in OpenGL or any other API (I haven't seen any evidence to the contrary).
Apparently all the OpenGL LLE implementations atm are broken in some way, so clearly it’s nontrivial.
We need a custom rasterizer and this is where things will start to get interesting ...

The two most important things for faithful looking results is exact rasterizer and exact depth interpolation,
compare and precision and these are unfortunately the things OpenGL does not guarantee anything about, except that it is consistent for that particular GPU.
OpenGL specifies its rasterizer based on an idealized barycentric scheme (cross product) and
N64 RDP is a scanline rasterizer with particular fixed point math to determine start/end of a
scanline plus what seems to be a highly custom way of dealing with AA using very exact coverage mask computation.

Another point is the combiner. Emulating this exactly clearly requires extremely specific fixed point math and the fixed function blending unit in OpenGL can't do this, so we would need to do programmable blending and depth test.
This isn't impossible, on tile-based mobile GPUs this is actually quite doable, but obviously far from as efficient as fixed function blending.
Desktop can do it as well, but it means a full ROP sync which really isn't something you want to do on an immediate mode GPU.

To do a custom rasterizer and combiner, I think in theory it could work like this:

LLE triangle:
- Draw a quad which is the bounding box (computed earlier on CPU). Pass along all relevant fixed point parameters as uniforms.
- Do a custom rasterizer check per pixel, discard if failed and interpolate everything by hand.
- Read per-pixel bits and do combine/depth manually, discard fragment if depth test fails, etc.

I don't think this is the way to go down though. The stalls here between pixels is batshit insane and won't scale well.
Also, the only API which would enable this kind of feature in a standard way is Vulkan with its multipass feature.

Another challenge with N64 it is highly coupled framebuffer so readbacks from RDP is a must for correct emulation.
We need to be able to fire off RDP rendering work and quickly get results back.
Fragment comes very late in the graphics pipeline, so readbacks become very expensive,
reading back from render targets often trigger a decompression on GPUs as well which further adds to delay.
This becomes painful if we're running any other shaders like perhaps VI filter emulation or post-process shaders in frontend, because that work from previous frames has to complete as well.

This leaves us with compute shader rasterization!
This might sound weird, but with modern compute-enabled GPUs this isn't as stupid as it sounds.
I think PCSX2 has done something similar and there are some well performing Cuda-based softrasterizers out there.

We can implement our own tile-based scheme where combining and depth testing happens in GPU registers rather than hitting main memory.
We can buffer up an entire frame’s worth of work, bin to tiles on CPU and dispatch a compute dispatch that churns all the tiles.
The RSP has already done all the hard work for us in vertex processing so I think the structure could look something like this on GPU (rough sketch):

Code:
buffer TriangleData { RDPCommandData triangles[]; }; // Filled on CPU
struct PerTileData { uint num_triangles; uint triangle_indices_hitting_this_tile[MAX_TRIANGLES_PER_TILE]; }; // Filled on CPU
buffer Tiles { PerTileData tile_data[]; };
buffer Framebuffer { Pixel pixels[]; }; // Streamed from RDRAM on rendering start. Flushed back out to RDRAM when needed (FB readback/modify on CPU).

workgroup_size = (8, 8); // 8x8 tiles, arbitrary.

main()
{
	uvec2 pixel = GlobalID.xy;
	uint frame_buffer = pixels[GlobalID.y * stride + GlobalID.x];
	uint num_triangles = tile_data[WorkGroupID].num_triangles
	for (uint i = 0; i < num_triangles; i++)
	{
		RDPCommandData tri = triangles[tile_data[WorkGroupID].triangle_indices_hitting_this_tile[i]];
                 if (rasterizer_test_passes(GlobalID.xy, tri, coverage))
		{
			InterpolateData interp = interpolate_color_and_sample_textures();
			if (depth_testing(frame_buffer))
			{
				frame_buffer = combiner_and_blend(frame_buffer, interp, coverage);
			}
		}
	}
	pixels[GlobalID.y * stride + GlobalID.x] = frame_buffer;
}
There are some problems with this right off the bat though.
If we batch up arbitrary triangles, the triangles can have completely different combiners and render states.
This becomes ugly really quickly since we need an ubershader to deal with all possible cases.
This hurts occupancy and lots of useless branches (but at least the branches are coherent!)

Other problems include textures changing a lot mid-frame (especially with that anaemic texture cache),
but this is probably not as bad since we can probably use 2D texture arrays with mipmaps
to contain a large number of independent textures of all possible sizes in the same texture description.

It is certainly doable to go through multiple tile passes per frame if we want to replace
the compute shader mid-frame due to changing combiners and rendering states.
Finding the right balance with number of passes and longer and more complex shaders is a very tinkering heavy implementation detail.
At least with Vulkan, we can hammer on with lots of dispatches and custom linear allocators can
avoid any driver churn with allocating lots of buffers here and there … :3

A compute approach like this can have significant improvement for the N64 LLE effort I think.
Some GPUs can do compute and fragment work in parallel which lets us hide a lot of the processing.
Some GPUs even have dedicated compute queues (exposed in Vulkan for example), which can improve this even more.
This means compute doesn’t have to wait for fragment to complete and read backs can be done cleanly and fast,
not having to wait for fragment to complete first.

Sure, this approach won’t get close to HLE rendering speed, but any half-decent GPU should still churn through this like butter,
this is 320x240 territory, not 1080p exactly :P

I’m playing around with a pure rasterizer using this approach with RDP triangle semantics to get a feel for how it works.
Reply With Quote
  #2  
Old 30th March 2016, 04:52 PM
Mr.64 Mr.64 is offline
Member
 
Join Date: Dec 2011
Posts: 59
Default

Quote:
Originally Posted by Themaister View Post
Apparently all the OpenGL LLE implementations atm are broken in some way, so clearly itís nontrivial.
Are they? I think you are underestimating OpenGL. OpenGL LLE RDP emulation is barely worked on. The only one who researched it was Ziggy back in 2007. His plugin gives good results you should check it out. Here is my fork on github https://github.com/purplemarshmallow/z64 I think OpenGL LLE has a lot of potential.

Compute shader rasterization sounds interesting but I fear you might get the disadvantages from both, the slow speed and resolution overheads of a software renderer and the inaccurate split memory architecture of a hardware renderer.
Reply With Quote
  #3  
Old 30th March 2016, 05:47 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Talking

Hello Themaister. I'm glad to see someone attempt this sort of project.

I'm pretty familiar with Angrylion's code. I've been studying, profiling, and optimizing it. Unfortunately I'm not much of an API programmer. I mostly focus on CPU. If I knew more about GPU programming, I think I'd be able to help out a lot. I do want to learn a graphics API someday. I was thinking I may start, once I feel I've done a good enough job with a software RDP.

I may not be able to help with GPU programming, but if you need any help understanding any part of Angrylion's code, I should be able to help with that. I've studied most of the code, but mainly focus on the bottleneck portions. I can also help with any optimizations that can apply to both software and hardware renderers.

I wish you the best of luck .
Reply With Quote
  #4  
Old 30th March 2016, 06:53 PM
Mr.64 Mr.64 is offline
Member
 
Join Date: Dec 2011
Posts: 59
Default

Yes good luck with this project. I hope what I wrote did not come over too harsh. I do think it’s a good idea, you can get around some limitations in OpenGL and be more efficient. I hope I did not kill your motivation. OpenGL plugins do suffer from problems with blending and depth compare and with compute shaders it should be possible to do a good job.
Reply With Quote
  #5  
Old 30th March 2016, 07:20 PM
retroben's Avatar
retroben retroben is offline
Alpha Tester
Project Supporter
Senior Member
 
Join Date: Jul 2013
Posts: 685
Default

This may greatly help GLideN64 with its software portion when running certain parts of Conker with it and a few other games.
For example,the scene for Bugga's Knuts when you defeat him,it uses software mode so he is censored which is painfully slow currently.

Might be only possible via GLideN64 if Vulkan is considered,the one argument for that is Vulkan can use OpenGL 4.5 features on capable PCs plus Android devices like Shield TV.
Reply With Quote
  #6  
Old 31st March 2016, 10:04 PM
Themaister Themaister is offline
Junior Member
 
Join Date: Mar 2016
Posts: 8
Default

First day of development, it seems like the rasterization part of things is working.

Used PeterLemon FillTriangle tests (silly board doesn't let me post URLs yet <_<) as a "hello world", and hacked Angrylion to dump out a file with the RDP command buffer for playback in the prototype. (Yay for small test programs!) In the end, it needed subpixel accuracy in Y to get correct results vs. Angrylion.

RPGMaster, maybe you can explain:

It seems like rasterizer in Angrylion creates spans which are inclusive, e.g. with coordinates at X = 10.75 and X = 20.25, it will create pixels at [10, 20] inclusive. This feels counterintuitive.

It will set up LX as the minimum of the LXs of the 4 subpixels in Y, and RX is the maximum, then it will loop later over (i = lx; i <= rx; i++).

From the looks of it, this is in effect conservative rasterization. You know if this is really the case? In effect it would mean that there are overlapping pixels when drawing two triangles which share an edge in e.g. triangle strips. Tie-break rules (in GL/D3D) solves this, but I don't see this in RDP.

Is it intended that you have to enable AA to make this less conservative?
It seems like the coverage calculation could take this into account at least.
Reply With Quote
  #7  
Old 2nd April 2016, 09:49 PM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

Quote:
Originally Posted by Themaister View Post
RPGMaster, maybe you can explain:

From the looks of it, this is in effect conservative rasterization. You know if this is really the case? In effect it would mean that there are overlapping pixels when drawing two triangles which share an edge in e.g. triangle strips. Tie-break rules (in GL/D3D) solves this, but I don't see this in RDP.

Is it intended that you have to enable AA to make this less conservative?
It seems like the coverage calculation could take this into account at least.
I'm not sure tbh. You would probably get a great answer from Angrylion, if you contact him.

I have read through most of Angrylion's code, but there are still some parts I am not completely familiar with. Lately, I've been focusing on the combiner, blender, and framebuffer code. I haven't focused on coverage yet.
Reply With Quote
  #8  
Old 3rd April 2016, 09:58 AM
Themaister Themaister is offline
Junior Member
 
Join Date: Mar 2016
Posts: 8
Default

Hm ... Well, got multisampled raster implemented. 8x AA is pretty wild for such an old GPU. It does indeed seem like the raster rules for AA mode are consistent w.r.t. shared edges. The coverage computation in Angrylion took some time to get, but it's kinda clever. In compute shaders however, I just translated it to vectorized comparisons.

I think I've got a pretty decent compute rasterizer now, so it's time to look at attribute interpolation and textures.

Attributes seem fairly easy, but textures are quite interesting.
The weirdness with attributes is the magical 3/4th pixel offset when DxHDy ^ flag == 0. I'm not sure what the idea of that is. It is also the only place dATTRdy is used, otherwise dATTRde is used, which is not very intuitive, but hey ...

I want to build a batching system where I fit all the tiles for a frame into a 2D array, then each primitive can access "tile descriptors" which contain (X, Y, layer) and various bits to do manual wrapping/clamp efficiently. Sparse textures would be quite nice here, but that's kinda overkill ...

Is there any code like this already implemented somewhere? I have only found non-batched implementations where every tile translates to glTexImage almost directly ...
Reply With Quote
  #9  
Old 17th April 2016, 07:47 PM
Themaister Themaister is offline
Junior Member
 
Join Date: Mar 2016
Posts: 8
Default

Some more progress. Batched texture 2D array worked out well, at least with smaller test cases. Started implementing combiner and blender. The big muxes in the combiner do worry me a bit, though. The single-threaded CPU rendering tricks of rerouting some pointers in set_combine() won't fly on GPU. I'm not sure how often real games swap out combiners, but it might be possible to avoid ubershaders if there's enough work to do between each combiner change if I can compile compute shaders that only deal with a tiny subset of possible render states.

Figured out the last puzzle with conservative raster.
Non-AA mode in Cycle1/2 modes do use first subsample as a singe-bit coverage when deciding to pass the blender stage (a strange place to decide on rasterization rules ...). So it seems only Fill mode is conservative, which is reasonable enough.

Added better RDP dumping support, so I hope to start tracing dumps from real games soon-ish.

EDIT: When I think about it, the worst muxing in combiner can actually be resolved ahead of time when building the batch.
There are tons of "mux in this static data that depends on global state" options.

Last edited by Themaister; 17th April 2016 at 07:54 PM.
Reply With Quote
  #10  
Old 19th April 2016, 01:29 AM
RPGMaster's Avatar
RPGMaster RPGMaster is offline
Alpha Tester
Project Supporter
Super Moderator
 
Join Date: Dec 2013
Posts: 2,029
Default

Generally from what I've seen, there aren't too many combinations being used per game. I don't know much about shaders, but I imagine it would be beneficial to compile compute shaders that only deal with a tiny subset of possible render states. Often times in Combiner_1cycle for games I've profiled, the combiner just does
Code:
(((a * c) + 0x80) & 0x1ffff) >> 8;
rather than
Code:
((((a - b) * c) + (d << 8) + 0x80) & 0x1ffff) >> 8;
. Sometimes, even
Code:
(d & 0x1ff)
is common.

For the combiner, you can take advantage of the fact that some of the colors are 8bit, like env, prim, and shade. Unfortunately, texel0 can't be guaranteed to be 8bit, even when the format isn't YUV, although you could simply make the assumption that it is 8bit when it's not YUV.

One part I'm stuck on optimizing for Angrylion's is texture_pipeline_cycle. That function, including the texel fetch functions, is the main bottleneck. Vectorizing it wasn't enough, although it certainly helped.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT. The time now is 09:38 AM.


Powered by vBulletin® Version 3.7.3
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.