If you are using CortexA8 than you can easily optimize your code with NEON SIMD(single instruction multiple data) instruction set. I guess you can optimize your code upto 83% or more than that (may be).
You just need to dive in to assembly for neon instruction set.
with
/// At the beginning of the file: #ifdef __ARM_NEON__ #include <arm_neon.h> #endif /// In the main function space: /** * @brief This function multiplies two floats. * * This function is optimized for the ARM NEON instruction set. However * a standard C fallback version is present as well (e.g. for x86 systems). * * @param f1 The first float number * @param f2 The second float number * @return The two floats multiplied **/ float optimized_function(float f1, float f2) { #ifdef __ARM_NEON__ /// ARM NEON Code implementation #else /// Standard implementation return f1 * f2; #endif }