Hi Graham,
You're on the right path. A lot of the extra-cycles you're seeing are part of the overhead of using the Hwi dispatcher. Using Hwi_plug get's rid of all that and you should see even less than 91 cycles. Here's sample code that uses the Hwi_plug.
(Please visit the site to view this file)
(Please visit the site to view this file)
Let me know if this helps.
Moses