Optimize service loop so that the starting point is the lowest-indexed
service mapped to the lcore in question, and terminate the loop at the
highest-indexed service.
While the worst case latency remains the same, this patch
significantly reduces the service framework overhead for the average
case. In particular, scenarios where an lcore only runs a single
service, or multiple services which id values are close (e.g., three
services with ids 17, 18 and 22), show significant improvements.
The worse case is a where the lcore two services mapped to it; one
with service id 0 and the other with id 63.
On a service lcore serving a single service, the service loop overhead
is reduced from ~190 core clock cycles to ~46, on an Intel Cascade
Lake generation Xeon. On weakly ordered CPUs, the gain is larger,
since the loop included load-acquire atomic operations.
Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Harry van Haaren <harry.van.haaren@intel.com>