# Evaluating and Comparing BER Performance

Submitted by winstead on Thu, 11/28/2013 - 08:48
One of my pet peeves with error correction research is that people frequently omit important details about bit error rate (BER) measurements. The BER performance is used to evaluate the quality of an algorithm, and to compare it against other competing algorithms. But sometimes two algorithms are very close in performance, and it is tempting to say, "my algorithm is best," even if it is only better by a small fraction. This can be problematic because BER is a statistical estimate, and the precise value is always a little uncertain. Even though one BER result looks lower than another, it might be just a lucky accident. BER values can also be distorted by the way random numbers are generated, or by the way simulations are conducted. In this post, I walk through some examples and give recommendations for correctly measuring, comparing and reporting BER estimates. As a first example, let's look at BER plots for two algorithms that are very close in performance:
BER plot for two hypothetical decoders.
Based on this data, it may be tempting to conclude that A is better than B (ignoring the 2.5dB point as an anomaly). But this graph doesn't tell the whole story. We might notice that the curve for Method B looks a little odd. It isn't smooth, which could indicate that more trials are needed to properly measure its performance. Before we can make proper comparisons, we need to understand the measurements' confidence intervals, and the possible distortions that may arise from simulation methods. Confidence Intervals At the frame level, decoder simulation is a Bernouli process in which each trial ends in success or failure. The total error count $N_E$ is obtained by summing over the failed outcomes, and therefore follows a binomial distribution. The error probability is estimated by $\hat{p_e} = \frac{N_E}{N},$ where $N$ is the total number of frames simulated. Since $N_E$ is a random variable, the estimate $\hat{p_e}$ is also a random variable, and its value is not certain. The range of uncertainty in $\hat{p_e}$ is quantified by the binomial proportion confidence interval, which indicates the range which most probably contains the true value of the frame error rate $p_e$. We usually speak of the "95% confidence interval" or the "99% confidence interval," which means that if we repeat the measurement 100 times, then we expect 95 or 99 of the outcomes, respectively, to fall within the predicted interval. The confidence interval itself is an easy calculation: $p_e \in \hat{p_e} \pm z\sqrt{\frac{1}{N}\hat{p_e}\left(1-\hat{p_e}\right)},$ where $z$ is a parameter that depends on the degree of confidence. For 95% confidence, $z=1.96$ and for 99% confidence, $z=2.58$. Now, we may repeat the above BER plot showing the 99% confidence intervals:
BER plot of two simimlar decoders, with confidence intervals.
Here we see that the curves' confidence intervals overlap at most points. That means that the $N$ isn't large enough to meaningfully resolve differences between the two curves. Note 1: In the discussion of confidence intervals, I spoke about frame error rates, but then I presented plots showing bit error rates. The frame error rate follows a true binomial distribution, and over a sufficiently large number of frames, the BER may follow an approximate binomial distribution, but this is not guaranteed since bit errors are not independent within a frame. It is therefore safer to evaluate confidence intervals for the FER. A recent article by Mazzeo and Rice (our neighbors at BYU) provides a detailed examination of confidence intervals for BER measurements:
• B. Mazzeo and M. Rice, "On Monte Carlo Simulation of the Bit Error Rate," Communications (ICC), 2011 IEEE International Conference on , vol., no., pp.1,5, 5-9 June 2011. [Xplore]
They recommend using an estimator based on the negative binomial distribution rather than the binomial approximation described here. I recommend studying their article and using the negative binomial method for estimating BER confidence intervals. Note 2: Many papers make claims about coding gain measured from BER or FER curves. In error correction theory, there is an analytical definition for the asymptotic coding gain, which predicts the SNR at which the error rate goes to zero. In a real decoder, we cannot measure "zero," so we might simply find the SNR for which the error rate is suitably low, e.g. ${\rm BER} < 10^{-6}$. From this perspective, we may define the empirical coding gain of one decoder in comparison to another as the difference between their SNR values for they respectively achieve a BER of $10^{-6}$. There are two big difficulties with this definition of coding gain: The first problem is that the BER measurement is inexact; therefore the SNR measurement for $10^{-6}$ is inexact. The second problem is that the BER curve is not a simple linear function of SNR, so it is not trivial to translate BER confidence intervals into SNR confidence intervals (and eventually into coding gain confidence intervals). It therefore seems to me that empirical coding gain is a really imprecise measure for comparing two decoders, and I don't mention it unless the difference is really big (greater than 0.25dB). Simulation Methodology The confidence interval immediately reveals ways that our simulation results might be distorted. Suppose I perform an "optimization" in which I repeatedly simulate a decoder, saving the best BER measurement. If I repeat the simulation more than 100 times, then it is possible that my "best" BER value is actually better than the lowest point on the 99% confidence interval. This is a form of cheating. If I really want to distort the results, I can simulate Method A and Method B, filtering their BERs so that I always record the worst measurement for A and the best measurement for B. That's another cheat. In order to avoid cheating, here are some recommendations for obtaining believable results:
1. After performing repeated simulations to optimize some parameters, perform another simulation to confirm the performance using the chosen parameters. In your final simulation, make sure you have a large $N$ and continue the simulation until $N_E > 150$. In this simulation, use a new seed for your random number generator so that you are not recycling old input frames.
2. For BER optimization and verification, do not use a Matlab or Octave model. Matlab is too slow for massive Monte Carlo simulations. If you want to use Matlab, then code your algorithm in C as a MEX function. Otherwise you can write your whole simulation in C/C++. I prefer using SystemC because it allows representing the decoder's physical architecture, it handles concurrent processing in a physically realistic way, and it facilitates "cycle-accurate" and "bit-true" simulations of decoder performance. In any case, you need to take the extra time to write your algorithm in C or C++.
3. If comparing two algorithms or architectures, the best approach is to simulate both competitors for the exact same input patterns. This means you should (ideally) include all models in your simulation so that they are compared under exactly identical conditions, receiving the same codewords with the same noise patterns. (This isn't always possible, but is recommended).
4. If comparing data from published literature, and the publication does not specify $N$ or $N_E$, then be cautious about drawing conclusions. If your algorithm is only a little better, then the improvement might be a statistical accident.
5. For final publication-grade simulations, use a good random number generator. Matlab provides good random numbers, but C/C++ libraries are not guaranteed to give quality results. I recommend studying the Numerical Recipes chapter on random number generators. Alternatively, you can use random number engines from standard libraries like the GNU Scientific Library (GSL) or the IT++ library of classes and functions for communications and signal processing.
These are suggestions, not absolute rules. There are always situational tradeoffs. For example, if you want to make your simulation code public (which I recommend), then you may want to minimize usage of specialized libraries like GSL, IT++ or SystemC (you can use them, but each library you add will limit the potential audience for your code).