Understanding Cache, latency and bandwidth [Archive]

Skouperd

27-09-2013, 09:43 AM

Hi everybody, this is part of a longer article. Please critique it in order for me to fix it before posting the final version online. Your honest opinions, crap or not, will be appreciated.

Building a fast system

I have been asked on numerous occasions from the people who buy a new computer that they just want a fast computer. Over the years, I have always struggled with the terms of what exactly makes a computer fast. Is it the clock speed of the CPU, the graphics card, the memory, the hard drives, or the speed at which I can access my banking details online? A gamer will argue that you need very good graphics cards with a CPU that supports high clock speed. Somebody working with SQL will argue they will rather need a fast hard drive solution and lots of fast memory. Somebody working on Excel and doing photo editing would like more cpu’s and perhaps a faster internet. Looking at the question again, “built me a fast system” suddenly becomes difficult. However, before we start delving into building an uber fast system, let us first understand how data moves around in a computer in order to make it fast.

Magnitude of Time

Just before we proceed with the next step, it is important to differentiate between orders of magnitude of time.

1 picosecond = 1,000,000,000,000 of a second (light will travel 1 mm in 3.3 picoseconds)
1 nanosecond = 1,000,000,000 of a second (light will travel just over 1 metre in 3.34 nanoseconds)
1 microsecond = 1,000,000 of a second (average human eye blinks in 350,000 microseconds)
1 millisecond = 1,000 of a second (light will travel almost 300Km in 1 millisecond)

Understanding Cache and Latency

I have read somewhere many years ago (think it was on www.howstuffworks.com (http://www.howstuffworks.com)) the difference between all the different kinds of memories and caches. I am going to use their explanation, as it is still the best I found to date. Assume you are a librarian that has to hand over the books at the counter to the people coming into the library. In order to create an efficient system, books that are requested frequently will be kept close by, say for instance in a backpack that you may carry. When somebody asks for it, you can flip it out and hand it to him or her in a matter of seconds. Books that are popular but not requested as much as those kept in your back pack, you may want to keep a stockpile of them under the front counter so you may hand the book over in a matter of minutes. It is not as fast as pulling it from your backpack but still a lot faster than searching for it on the shelves. Assume you have a book that is rarely requested, if a customer asks for it, you will then go and find it on one of the shelves and this may take hours to find. Now assume you have a very rare request, then you may have to order it in from the state library and this could take days to receive.

The backpack is the smallest area you can store books in, but that is also the fastest place to retrieve data from. As you move on, the amount of storage space starts to increase but the time in which you can retrieve the information reduces. The backpack may only store a handful of books, the front counter you may have tens of different books stored, the bookshelves behind you may have hundreds of books, whereas the state library may have thousand.

Computers work the same; there is different kind of storage spaces installed in your computer. The faster it is to access, generally the smaller the usable space. As the example above where we determined the speed at which you could hand the book over to the customer, we can also measure the time it takes for data to flow from the various points to the CPU. However, when dealing with a computer we measure the time in either nano seconds “ns” (one billionth of a second), microseconds (one millionth of a second), milliseconds “ms” (one thousandth of one second), or seconds compared to the seconds, minutes, hours and days for the library scenario.

We start with what is installed directly on the CPU (the backpack) called “Cache”. Different levels of cache exist, in most modern CPU’s there are three different levels of cache. The below is a summary of three different CPU’s

CPU e5-2670 i7-3820 i3-M390
L1 Cache (code) 256KB 128KB 64KB
L1 Cache (data) 256KB 128KB 64KB
L2 Cache 2MB 1MB 512KB
L3 Cache 20MB 10MB 3MB

Using the e5-2670 as an example, it has the following on-die cache:

Level one, also known as “L1” has 256KB for code and 256KB for data
Level two, “L2” contains 2MB
Level three “L3”, has 20MB.

Cache on the L1 is the smallest but also the fastest to access at around 4 CPU cycles (1.2ns), L2 at 12 CPU cycles (3.7ns), while the L3 cache at 26 CPU cycles (6.6ns). As such, accessing the CPU’s cache is extremely fast.

However, if the data or code you are looking for is not found in the CPU’s cache, then it will go to the Random Access Memory (or “RAM” or just “memory”) to find the data there. Having installed 16GB of quad channel memory allows the CPU to obtain data from the RAM modules at a rate of 12.8GB/s with a latency on of around 65ns. Even at a latency of 65ns, it is still more than 50 times slower than the L1 cache, and almost 10 times slower than the L3 cache. In 65 ns, light will travel less than 20 metres.

Understanding Bandwidth

In the previous section we have spoken about latency, which in simplistic terms is nothing more than the time it takes to move from one point to the next. This section now will deal with the bandwidth; bandwidth is for all practical purposes the amount of data that is transfer at a specific point in time. Latency asks how long it will take to get the data, while bandwidth will tell you how much data you transfer.

The easiest explanation is to think about vehicles moving from point A to point B. If I had to ask for the fastest car, then the first reaction is to consider latency. Most people when given a choice between a Ferrari and a mini-bus will say that the Ferrari is the fastest. However, if the objective is to transfer 20 people from point A to point B, then the mini bus taxi, with its more bandwidth will be the superior choice. Computers operate in the same manner, certain applications requires very low latencies (need to transfer only a single person) whereas others require more bandwidth (requires to transfer 20 people)

Because of the combination of latency and bandwidth, gamers tend to opt for fewer cores (less bandwidth) but with higher clock speeds (better latency) whereas people doing multimedia editing will opt for less clock speed (poor latency) but go for more cores (bandwidth).

Finding a balance between latency and bandwidth is something that extends beyond the CPU. The fight between latency and bandwidth is also fought in the area of RAM, hard drives, to the kind of line you are getting for your internet access.

Latency and bandwidth for CPU and RAM
Now that we understand the differences between Latency and Bandwidth, below are the benchmarks indicating what they are on the e5-2670 which has 20MB of L3 cache.

522

What this tells us is that every time that the data is moved one-step further away from the CPU; the bandwidth reduces to roughly 55% of the previous bandwidth (105GB/s, 59GB/s, 28GB/s, 16GB/s). The latencies however increases almost exponentially, 1.2ns, 3.7ns, 6.6ns, 64.5ns and then it starts slowing down seriously from that point onwards, because, should the data not reside in the CPU’s cache, or the RAM, then the data will most likely be on the hard drive.

Latency and bandwidth on other components
Access times on hard drives is no longer measured in nanoseconds, but instead is measured in milliseconds or microseconds. (1 millisecond = 1,000,0000 nanoseconds while 1 microsecond = 1,000 nanoseconds). Fast mechanical hard drives, like the Raptors, will have a latency of around 8 milliseconds (or 8000 microseconds) but when we start to look towards SSD’s then we can measure the access time in microseconds (30-100). From a bandwidth perspective, when we access data on a Raptor hard drive the bandwidth will be about 140MB/s compare to when we access the data on a fast SSD, being around 550MB/s.

However, if the data is not found on the hard drive, then the next place will be to look on the local network. The local network will have access times measured in milliseconds ranging 0.6ms (600 microseconds) to about 0.8ms. The bandwidth for a 1Gb network will allow you to transfer data to your CPU at a rate of around 120MB/s.

Latency and bandwidth
The further away the data resides from the CPU, the slower it becomes to access it. It is slower in terms of not only latency, but also the available bandwidth. So, in order then to build the fastest computer possible, it makes a lot of sense to have all the data stored in the CPU’s cache. So why don't CPU manufactures just build CPU’s with a couple of GB of cache? The best reasons I can figure out is the following:

Latencies will increase as it is now more complex to search the cache for the particular piece of data you require. As soon as you start increasing the cache beyond a certain level then the latencies starts to increase again and you are defeating the purpose of its original design.
the silicon use to build CPU cache is very expensive, increasing the production costs and also the price consumers pay for their products.
finally, the physical space available on the CPU is quite small and the memory already takes up a big chunk of the available space. Combined with physical space, if the CPU’s need to be bigger in size then the heat it generates increases, bigger sockets is needed, CPU’s will require more energy together with more cooling.

As a result of the above, CPU manufacturers spend a lot of time to determine the optimal balance between latency, cost, and purpose of the CPU. Because of the above reasons, it is not unusual to see CPU’s with only 1MB or 2MB of L3 cache. The e5-2670 CPU however has 20MB of on-board L3 cache, which is considerably bigger than the average CPU.

People may ask but hold on, how much data can you really store in 20MB of CPU cache let alone 2MB? In order to understand that, think about how a program is written, we will use a simple “For i = 1 to 1,000,000; next i”. That piece of code is very small yet it will occupy the CPU for at least a million cycles. Now if the CPU needed to obtain the piece of code from the slower memory, or worse the hard drive for each one of the million times it need to process the code, it will become extremely frustrating. Note that that piece of code is small enough to fit into the L1 cache.

When building a fast system, you have two things to consider. The one is the time it will take to access the specific data (latency) while the other one is how much of the data you can access simultaneously (bandwidth). If you intend to get a fast system, irrespective if you are a gamer or a data cruncher, it is important to optimise the specific one that you will use most often.

For people doing data crunching, it is important to get data from the hard drives into the RAM as fast as possible. For video editing, you want to have a lot of bandwidth from the RAM to the CPU but also a fair amount of low latencies on the CPU. For gamers, you want to process very little data but you need to process them as fast as possible, that is why gamers tend to opt for fewer cores but higher clock speeds. Despite new titles taking more advantage of multiple cores, the focus is still on higher clock speeds.

Conclusion:
Hopefully this article explained that a fast computer for one person does not necessarily mean it is a fast computer for the next person. Installing a very fast CPU in a data crunchers’ computer but having 5900RPM hard drives will be as frustrating as installing a RAID 0 array of SSD’s with a Celeron CPU for a gamer.

Trying to build a fast system is a very tricky situation and it requires careful consideration when choosing the most appropriate components.

In my next article, I will attempt to build a very fast, general purpose, system capable to not only play the latest games but also do the data crunching as and when required.