qsort vs. std::sort

When writing high performance code, there are pretty well two options: C and C++.

  • C is essentially a portable wrapper around assembly. You can tell exactly how much every action costs, where memory is allocated, and where it is freed. Nothing is hidden. However, because of this direct control, writing C code is very tedious and error prone.
  • C++ is a bit higher level. You give up some control in exchange for ergonomics. For example, returning a vector from a function is the difference between an O(1) and O(n) operation, depending on if the compiler implements return value optimization. However, C++ templates give the compiler more type information, which it can sometimes use to generate faster code.

A classic example of when C++ outperforms C is sorting. In item 46 of his book Effective STL, Scott Meyers claims that std::sort can outperform qsort by up to 670% because C++ will inline the comparison function during template instantiation. Let's see how they compare with sorting a vector of a million random integers. Writing and measuring accurate benchmarks is difficult, so we're going to let the Nonius framework do it for us.

// sort.cpp
#define NONIUS_RUNNER
#include "nonius.h++"
#include <algorithm>
#include <cstdlib>
#include <random>

// qsort
static int comp(const void* const a, const void* const b) {
    const int arg1 = *static_cast<const int*>(a);
    const int arg2 = *static_cast<const int*>(b);

    // we can't simply return a - b, because that might under/overflow
    return (arg1 > arg2) - (arg1 < arg2);
}

// std::sort with no inlining
struct compare_noinline {
    __attribute__((noinline))
    bool operator()(const int a, const int b) const {
        return a < b;
    }
};

// std::sort with inlining
struct compare {
    // the compiler will inline this automatically
    bool operator()(const int a, const int b) const {
        return a < b;
    }
};

static std::vector<int> random_vector(const size_t size) {

    std::random_device seed{};
    std::default_random_engine engine{seed()};
    std::uniform_int_distribution<int> dist{std::numeric_limits<int>::min(),
                                            std::numeric_limits<int>::max()};

    std::vector<int> vec(size);
    std::generate(vec.begin(), vec.end(), [&]{ return dist(engine); });

    return vec;
}

// generate a vector of a million random integers
constexpr size_t size = 1'000'000;
static const auto rand_vec = random_vector(size);

NONIUS_BENCHMARK("qsort", [](nonius::chronometer meter) {

    // Nonius does multiple runs of the benchmark, and each one needs a new
    // copy of the original vector, otherwise we'd just be sorting the same
    // one over and over
    const auto runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs, rand_vec};

    meter.measure([&](const size_t run) {
        auto& current_vec = vectors[run];
        std::qsort(current_vec.data(), current_vec.size(), sizeof(int), comp);
    });
});

NONIUS_BENCHMARK("std::sort noinline", [](nonius::chronometer meter) {

    const auto runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs, rand_vec};

    meter.measure([&](const size_t run) {
        auto& current_vec = vectors[run];
        std::sort(current_vec.begin(), current_vec.end(), compare_noinline{});
    });
});

NONIUS_BENCHMARK("std::sort inline", [](nonius::chronometer meter) {

    const auto runs = static_cast<size_t>(meter.runs());
    std::vector<std::vector<int>> vectors{runs, rand_vec};

    meter.measure([&](const size_t run) {
        auto& current_vec = vectors[run];
        std::sort(current_vec.begin(), current_vec.end(), compare{});
    });
});

Let's compile this bad boy.

# Apple Clang 8.0.0
# Note that Nonius has a dependency on Boost
$ clang++ -std=c++14 -stdlib=libc++ -O3 -march=native sort.cpp -o sort
$ ./sort

Nonius prints out quite a lot of information, so I'll summarize the most important bits.

qsort                253 ms +/- 8 ms
std::sort noinline   137 ms +/- 6 ms
std::sort inline      88 ms +/- 5 ms

With no inlining, std::sort is already 1.8x faster than qsort. I am unsure why this is, it could be from differences in algorithms. Inlining bumps this up to 2.9x faster: an impressive speedup.