HEXTRAIN on MTXplus+

The problem

HEXTRAIN is a very demanding game (from a timing point of view). It runs on MEMU, MTX with REMEMOrizer, REMEMOTECH, and MTX with CFX. But when we ran it on MTXplus+ at Memofest 2016 this is what happened :-

HEXTRAIN for MTXplus+ was unveiled at Memofest 2017.

Baseline spec

All of the systems that worked either have a 9929 VDP, or an equivelent that requires no delays between accesses.

The best knowledge that we have about the 9918A/9929A VDP is that successive data input/outputs must be 2us apart if video output is disabled (blanked), in the vertical blank, or in text mode. In the active part of the display, in other modes, this must be 8us. This corresponds to 8T or 32T apart for a 4MHz Z80. In practice the fastest a Z80 can output is once every 11T, so its only outputs during the active part of the display that need delays. No restrictions are known for accesses to the VDP control port.

We've also never needed to add delays :-

Finally, there is a common trick in which it is possible to only output the low byte of the VDP address register if the high byte is unchanged. Pothole Pete does this for example, and so does HEXTRAIN.

MTXplus+ spec

The MTXplus+ either has a Z80 or a Z180, at various clock speeds. Supported clock speeds include 4MHz, 8MHz, 16MHz and various other values.

The VDP is a Yamaha V9958, which is touted as "compatible" with the V9938 and thus 9918A/9929A.

The V9958 sometimes needs longer I/O delays than a 9918A/9929A :-

Another important observation is that if we only write the VDP low address screen corruption results. In fact, this may have been the biggest contributor to the screen corruption seen at Memofest 2016.

The V9958 can be configured (set bit 1 of register R#9, the *NT bit) to work in 50Hz rather than 60Hz mode. Doing so creates a larger vertical blank time, which HEXTRAIN relys on.

The V9958 can also be configured (set bit 2 of register R#25, the WTE bit) to keep the Z80/Z180 waiting until its internal V9958-to-VRAM transfer is done. This is presumably to allow faster CPUs.

MTXplus+ has a VS2 GAL which has the following inputs :-

This GAL has a state machine that can add additional wait-states at certain speeds. The GAL produces a /WAIT signal that goes to the Z80/Z180. However, from experimentation we see that at 4MHz, not only does the GAL not add additional wait-states, it also does not propagate the /VDPWAIT signal to the Z80/Z180 /WAIT.

An anomoly/glitch was spotted when reading the VDP status register. A test was written to count the CPU cycles between video frames. It waited for bit 7 of the status register to transition from 0 to 1 (indicating the end of the active area), and then counted cycles until the next transition from 0 to 1. On MEMU, a real MTX, or REMEMOTECH, the answer is consistently the same. On MTXplus+, about one third of the time, the answer is double the expected value. This doesn't happen when running at 4MHz or 7.112MHz, but does happen at 8MHz. This implies that we "miss" a transition. The V9938 spec says that when the status register is read, bit 7 is cleared. I currently suspect that somehow a single I/O read issued by the Z80 glitches and the V9958 interprets it as two I/O reads - the first has the side effect of clearing bit 7, and the second returns a value with bit 7 cleared. Or, alternatively, because the Z80 is running faster than it was in the MSX2, the read actually doesn't properly work. This anomoly is unexplained.

The solution

First, we only support the Z80, not the Z180 variant (at least at present). To support Z180 would require additional delays in the code.

Next, we run the Z80 at 7.112MHz rather than 4MHz for these reasons :-

The V9958 is put in PAL mode (ie: 50Hz).

The V9958 WTE feature is enabled. This means that the code need never include delays between VRAM writes.

When MTXplus+ starts, it may have been started from "Colour CP/M" mode. HEXTRAIN therefore has to reset the high part of the sprite address register and also reprograms the palette to the 9929A compatible defaults.

The HEXTRAIN "compiler" compiles the huge data file differently. It writes the file as htplus.bin rather than ht.bin. It never emits delays between VRAM writes. When it comes to output VDP addresses it needs a big delay beforehand, to write both low and high address, and a big delay afterwards. If this code were emitted into the generated data file, it would bloat it beyond the ~12KB maximum allowed. Instead the following is emitted in the code stream :-

RST 30H ; 11T

and the following is emitted in the data stream (where HL points) :-

DB vdp_addr_low
DB vdp_addr_high OR 040H

This has the nice side effect of shrinking the size of the generated .bin file.

The HEXTRAIN runtime has cleverly written the following code at 00030H (and luckily it is safe to do this) :-

LD C,2 ;  7T
OUTI   ; 16T
OUTI   ; 16T
DEC C  ;  4T, now C=1
RET    ; 10T

This provides the following timings :-

11+7+16   = 34T@7.112MHz > 4.500us between VRAM write and address write
            16T@7.112MHz > 2.000us between address low and high writes
4+10+11   = 25T@7.112MHz > 3.125us between address write and VRAM write

Notice in the above I am considering the write to happen towards the end of the OUT instruction. So its 11+7+16 for the first one, and its 4+10+11 for the last (assuming the RET is immediately followed by an OUT instruction).

I observe that for the mix of instructions in a typical screen update, the WTE mechanism results in the actual number of cycles being roughly double the expected number from the generated instructions. This isn't that surprising - as Martin puts it: we're running at twice the speed, but the V9958 on average isn't any faster than the 9918A/9929A. HEXTRAIN reads the Tcycle count for a given screen update from the binary file, so the MTXPLUS version doubles the number.

The runtime binary has an estimate for the time it takes to read the data from disk. For Silicon Disc or SD Card, a guess of 32T/byte is used, but for Compact Flash I have less efficient/optimized code. In particular, it does N 1 block reads rather than 1 N block read. So a closer estimate of 64T/byte is used instead.

The last difference is addition of a + character in the game title.

The resulting runtime binary is called HTCFPLUS.COM instead of HTCF.COM.

Gameplay of the MTXplus+ version should be identical to the normal version and the saved game files are interchangeable.

Links

Thanks

Finally it was only possible to get HEXTRAIN running on MTXplus+ because Martin sent me the MTXplus+ prototype, and through a sequence of 30 or so emails, and through running some tests, helped me understand whats going on.

{{{ Andy