Sunday, April 24, 2016

Windows: 32-bit, 64-bit, 128-bit?

The switch from 16 to 32 bit made perfect sense since the memory requirement of many applications was already above the limits of 16 bit address space (2^16 = 64 KB), and the extra memory space (2^32 = 4GB) was aligned with the physical memory range of most mainstream computers.

In contrast, the switch from 32 to 64 bit has been proven beneficial to far fewer applications. The benefits of the wider registers are limited to applications that require heavy math computations such as encryption software and the like. And while the larger address space (up to 2^64 = 16 Exabyte!) is useful for applications that benefit from more than 3/4GB (XBOX One for example uses 8GB), it also come at the expense of data locality (the processor cache didn't grow with the address space) and scalability .

Since with 64 bit we are already hitting the limits of practical processors, switching to 128 bit and introducing yet another WoW (Windows on Windows) compatibility layer (on top of the existing WoW64 sub system) makes even less sense, especially for the desktop.

Note that a decade after 64-bit computing arrived to the desktop, extra large applications like Outlook and Visual Studio still fit, and are better off in 32 bit space.

The Visual Studio team has been having the same conversations around taking the leap to 64 bit space in the past 10 years, and the jury is still out. Outlook supports a 64 version, but only to allow 64 bit extensions, not for the benefits of the extra RAM. The Edge browser 64 bit version is installed by default on 64 bit hardware only for security reasons, at its core it’s optimized for data locality and making very economical use of memory (Edge’s main process is 1/4 the size of IE’s main process and 1/2 the size of Chrome’s main process) .

The fact of the matter is that when it comes to space, less is more. Applications that use less memory space (without increased use of virtual memory / swapping to disk) run faster and scale better (physical RAM in mainstream computers/tablets ranges somewhere between 1-24 GB) .

Perhaps a truly groundbreaking innovations in hardware, power consumption and (followed by) computing will allow to use that space to process the entire World Wide Web data, from a single computer running Windows...If that would be the case, I bet that the 128 bit bus will be used to access a component completely different from a silicon memory chip.

If disk is the new tape, memory is the new disk, this 128 bit capable beast would be the new memory.

Saturday, April 9, 2016

Scale up and scale out with .NET framework

While the scalability of an application is mostly determined by the way in which the code is written, the framework / platform that is being used can significantly influence the amount of effort required to produce an application that scales gracefully to many cores (scale up) and many machines (scale out).

Before joining Microsoft, I was part of a team that built a distributed, mission critical Command and Control system using .NET technologies (almost exclusively). The applications that make up the system are deployed on powerful servers, they are all stateful, massively concurrent, with strict throughput requirements. We were able to build a quality, maintainable, production ready system in less than 3 years (which is a record time for such system).

While working for Microsoft in the past 6 years, I worked on hyper scale services deployed on thousands of machines, all of which have been written using .NET tech.

Here’s how .NET empowers applications that need to scale up and scale out.

Scale up:

Building a reliable, high performance application that scales gracefully to many core hardware is hard to do. One of the challenges associated with scalability is finding the most efficient way to control the concurrency of the application. Practically, we need to figure out a way to divide the work and distribute it among the threads such that we put the maximum amount of cores to work. Another challenge that many developers struggle with is synchronizing access to shared mutable state (to avoid data corruption, race conditions, dead-locks and the like) while minimizing contentions between the threads .

Concurrency Control

Take the below concurrency/throughput analysis for example, note how throughput peaks at concurrency level of 20 (threads) and degrades when concurrency level exceeds 25. 

​So how can a framework help you maximize throughput and improve the scalability characteristic of your application?

It can make concurrency control dead simple. It can include tools that allow you to visualize, debug and reason about your concurrent code. It can have first-class language support for writing asynchronous code​. It can include best in class synchronization and coordination primitives and collections.

In the past 13 years I’ve been following the incremental progress that the developer division made in this space - it has been a fantastic journey.

In the first .NET version the managed Thread Pool was introduced to provide a convenient way to run asynchronous work. The Thread Pool optimizes the creation and destruction of threads (according to JoeDuffy, it cost ~200,000 cycles to create a thread) through the use of heuristic ‘thread injection and retirement algorithm’ that determines the optimal number of threads by looking at the machine architecture, rate of incoming work and current CPU utilization.

In .NET 4.0, TPL (Task Parallel Library) was introduced. The Task Parallel Library includes many features that enable the application to scale better. It supports worker thread local pool (to reduce contentions on the global queue) with work stealing capabilities, and the support for concurrency levels tuning (setting the number of tasks that are allowed to run in parallel)

In .NET 4.5 -  the async-await keywords were introduced, making asynchronicity a first class language feature, using compiler magic to make code that looks synchronous run asynchronously. As a result – we get all the advantages of asynchronous programming with a fraction of the effort.

Consistency / Synchronization

Although more and more code is now tempted to run in parallel, protecting shared mutable data from concurrent access (without killing scalability) is still a huge challenge. Some applications can get away relatively easy by sharing only immutable objects, or using lock free synchronization and coordination primitives (e.g. ConcurrentDictionary) which eliminate the need for locks almost entirely. However, in order to achieve greater scalability there’s not escape from using fine-grained locks.

In an attempt to provide a solution for mutable in-memory data sharing that on one hand scales, and on the other hand easy to use and less error prone than fine grained locks – the team worked on a Software Transactional Memory support for .NET that would have ease the tension between lock granularity and concurrency. With STM, instead of using multiple locks of various kinds to synchronize access to shared objects, you simply wrap all the code that access those objects in a transaction and let the runtime execute it atomically and in isolation by doing the appropriate synchronization behind the scenes. Unfortunately, this project never materialized.

As far as I know, the .NET team was the only one that even made a serious effort to make fine grand concurrency simpler to use in a non functional language.

Speaking of functional languages, F# is a great choice for building massively concurrent applications. Since  in F# structures are immutable by default, sharing state and avoiding locks is much easier. F# also integrates seamlessly with the .NET ecosystem, which gives you access to all the third party .NET libraries and tools (including TPL).

Scale out:

Say that you are building a stateless website/service that needs to scale to support millions of users: You can deploy your ASP.NET application as WebSite or Cloud Service to Microsoft Azure public cloud (and soon Microsoft Azure Stack for on premises) and run it on thousands of machines. You get automatic load balancing inside the data-center, and you can use Traffic Manager to load balance requests cross data-centers. All with very little effort.

If you are building a statesful service (or combination of stateful and staeless), you can use the Azure Service Fabric, which will allow you deploy and manage hundreds or thousand of .NET applications on a cluster of machines. You can scale up or scale down your cluster easily, knowing that the applications scale according to available resources.

Note that you can use the above with none .NET applications. But most of the tooling and libraries are optimized for .NET.