I’ve just read the article Use Thread-local Storage to Reduce Synchronization from the Intel Guide for Developing Multithreaded Applications and here is my take on it.
Indeed. If threads are a tool for concurrency, synchronization is a tool for suppressing concurrency. Any form of synchronization (no matter locks, semaphores or atomic operations) hinders concurrency. It requires a distributed system to achieve strong global consensus, and consensus in a distributed system can’t be cheap. Period.
So the best design of a concurrent system tries to reduce synchronization to a bare minimum (total elimination of synchronization is impossible, otherwise the system will break up into several independent systems). There are various techniques for reducing the need for synchronization – partitioning, privatization, replication, amortization.
The idea of partitioning is to split whole data-set into several mostly independent partitions, a worker thread is bound to each of the partitions, plus there must be some partitioning function that maps an external key to a partition where the data resides. Then, all requests are routed directly to a thread that bound to the required partition. As a result, a worker thread works with the partition’s data without any synchronization.
Privatization is a private case of the partitioning with a single partition, i.e. whole data-set is handed over to a single thread which can work with it without any synchronization. The negative side of this technique is that the single thread can become a bottleneck, other threads concurrently running on other cores can overwhelm it with requests.
The idea of replication is to have several independent replicas of a data-set, and propagate updates between replicas explicitly via messages. Data in replicas can be temporary inconsistent, however a lot of systems can tolerate some inconsistency.
Amortization is usually based on some form of thread-local data (placed either on a thread’s stack or in a compiler/OS-provided storage). The idea is simple – we collect some updates in thread-local storage and then apply them later in batches. That’s what we saw in the article. The main advantage of amortization based on thread-local storage is it’s simplicity. Indeed, you do not need to reorganize your data, to route requests to particular threads based on data placement, cope with inconsistencies, etc. So, if it’s applicable it’s the first thing you must consider.
Well, there are too many things I can say on these things… a way too many to fit into this blog. But what I want to communicate is that you must consider these things as a starting point rather than a final destination, they are a primitive tools for reducing synchronization in your concurrency toolbox. Choose the best tool for a particular situation, combine them, adopt them.
Now a few comments directly on the article.
This solution trades synchronization per event for synchronization per thread. Performance will improve if the number of events is much larger than the number of threads.
I would not agree here, there is no such a tradeoff involved. If a thread had not collected any events in his thread-local storage, then he just does not access centralized data at all. The additional overhead is a single ‘if’ statement per thread, which is negligible in a context of inter-thread work distribution. This technique does not increase the total number of events.
An additional advantage of using thread-local storage during time-critical portions of the program is that the data may stay live in a processor’s cache longer than shared data, if the processors do not share a data cache. When the same address exists in the data cache of several processors and is written by one of them, it must be invalidated in the caches of all other processors, causing it to be re-fetched from memory when the other processors access it. But thread-local data will never be written by any other processors than the one it is local to and will therefore be more likely to remain in the cache of its processor.
In general this is very true. Indeed, thread-local data reduces amount of inter-core communication, thus reducing amount of costly cache-coherence traffic.
But, this has little to do with shared cashes, even if cores share L3 cache, data still will be transferred between their L1 caches (L1 caches are not shared between cores on most current processors). So I would recommend to just ignore the part on shared caches. Prefer thread-local data and you are on the safe side with any current or future architecture.
There is another important consideration with regard to shared L2/L3 caches (which are featured on many current processors), and this consideration is against thread-local data. Consider the following situation. Moderate size shared object is frequently accessed for reading, but infrequently for writing. If it is split into thread-local parts (which usually implies increase in size), it will not fit into shared L2/L3 cache, thus threads will constantly evict each others data from the cache. However, if the object is implemented as a single centralized entity, it fits into the cache, thus threads will work with cached data without evictions.
So, the tradeoff frequently involved is reduction of synchronization versus increase of total working set. Which to prefer is highly dependent on the situation.
One must be careful about the trade-offs involved in this technique. The technique does not remove the need for synchronization, but only moves the synchronization from a time-critical section of the code to a non-time-critical section of the code.
Well, I would say that the main point of the technique is reduction of synchronization rather than move of the synchronization from one part of the code to another. Amortization via thread-local storage can not involve any movement of synchronization at all. The technique can be applied in two forms: single final aggregation or periodic aggregations. The latter does not involve any movement of synchronization while still reduces synchronization overheads. And it has additional benefit that separate monitoring thread can periodically fetch and output intermediate results.
Consider, for example, the following program:
long total_event_count;
__declspec(thread) long thread_event_count; // thread-local cache
void thread_function(size_t begin, size_t end)
{
for (size_t i = begin; i != end; i += 1)
{
if (predicate(i))
{
thread_event_count += 1;
// if we have cached enough events,
// transfer them to global shared variable
if (thread_event_count == THRESHOLD)
{
_InterlockedExchangeAdd(&total_event_count, thread_event_count);
thread_event_count = 0;
}
}
}
// transfer the remainder of locally cached events
if (thread_event_count)
_InterlockedExchangeAdd(& total_event_count, thread_event_count);
}
In the above example the synchronization is not moved to another point, but it’s still reduced by a factor of THRESHOLD. Separate monitoring thread can periodically read and output total_event_count variable, and there is a guarantee that total_event_count does not lag behind real value of discovered events by more than NUMBER_OF_THREAD * THRESHOLD.
Note that thread-local data may be actually shared between threads, there is nothing preventing this. A method of declaration of a variable is orthogonal to it’s “shared-ness”. Address of a variable declared as __declspec(thread)/__thread/omp threadprivate/pthread_key_create()/TlsAlloc() can be passed to another thread, and thus the variable become shared. Just as plain global variable can be ever accessed by a single thread, and so it’s local to the thread.
Also note that you can get a flavor of thread-local data with plain global array indexed by a unique thread index. This technique is less dependent on a particular compiler/OS, and makes sharing of thread-local data much easier (infrequent sharing is not dangerous and anyway necessary in any real-world program). Here is a simple example:
// array of "thread-local" data
long volatile event_counts [MAX_THREAD_COUNT] = {};
// sequence used to generate unique thread indexes
long volatile thread_sequence = 0;
// worker thread routine
void worker_thread(size_t begin, size_t end)
{
// obtain unique thread index
long my_idx = _InterlockedIncrement(&thread_sequence) - 1;
for (size_t i = begin; i != end; i += 1)
{
if (predicate(i))
event_counts[my_idx] += 1;
}
}
// monitoring thread routine
void monitor_thread()
{
while (termination_condition == false)
{
// obtain current thread count
long thread_count = thread_sequence;
long sum = 0;
for (long i = 0; i != thread_count; i += 1)
sum += event_counts[i];
printf("event count: %u\n", (unsigned)sum);
Sleep(1000);
}
}
However, be aware that the above example contains a nasty instance of false-sharing which kills performance. You can read about how to cope with it in the article Avoiding and Identifying False Sharing Among Threads.
Keep threading!
. Read the rest at Intel.com.