Improving Sequential Read on Oracle Cloud Infrastructure-Classic instances of Ubuntu Linux

When a distribution is compiled and packaged, it is somewhat unknown how it will be used by any particular end user. An optimization or a setting that might benefit one type of I/O load can potentially have a negative impact on another type of load.

Ubuntu Image for Oracle Cloud Infrastructure-Classic is capable of running various I/O load types and by default comes preconfigured to run as wide of an I/O load matrix as possible.

But as such, there might be areas where optimizations could be made. One such case is in large sequential read operations.

System administrators and/or application developers are the ones usually privy to information on how a particular operating system instance will be used.

This article intends is to help those power users of Ubuntu Instances on Oracle Cloud Infrastructure-Classic to optimize sequential read I/O performance of their instances.

1. Function posix_fadvise():

If your system I/O load involves reading contiguous and sequential chunks of large files, you may benefit from making the operating system aware of this.

The System call interface of a Linux operating system provides an API for applications to hint to the Linux kernel what type of file access an application will perform. That can be achieved by the use of fadvise64() system call or the wrapper function posix_fadvise():


NAME
      posix_fadvise - predeclare an access pattern for file data
SYNOPSIS
      #include 
      int posix_fadvise(int fd, off_t offset, off_t len, int advice);

2. Function readahead():

Or the application can request a certain amount of page cache pages be preloaded with the data from a particular file before it starts issuing read() system calls. This can be achieved by readahead():


NAME
      readahead - initiate file readahead into page cache
SYNOPSIS
      #define _GNU_SOURCE             /* See feature_test_macros(7) */
      #include 
      ssize_t readahead(int fd, off64_t offset, size_t count);

3. Command blockdev:

While system calls above might work for application developers, but they won’t work for system administrators who are not developers of the application they are administering. Luckily, there is another way to optimize sequential read and that is through blockdev command:


NAME
      blockdev - call block device ioctls from the command line
SYNOPSIS
      blockdev [-q] [-v] command [command...] device [device...]
      blockdev --report [device...] 

With blockdev, we can query the operating system readahead size using the –getra and –getfra command parameters:


      --getfra
             Get filesystem readahead in 512-byte sectors.
      --getra
             Print readahead (in 512-byte sectors).

On Ubuntu 16.04, the readahead parameters are set to 256 512-byte sectors(128K):


ds@codeminutia:~$ sudo blockdev –getra /dev/xvdb
256
ds@codeminutia:~$ sudo blockdev –getfra /dev/xvdb
256

That means that on every file read() an application does, the Linux kernel will actually read 128K worth of data from storage and store that data in the page cache. This happens even if you called read() requesting only 1K from the file.

The operating system does this in anticipation that the application is going to do another read(1K) system call and this read is going to be much faster because it can be serviced from the buffer cache instead of going all the way down to the storage backend.

But sometimes applications read sequential data in larger chunks than 128K. The fio test below simulates sequential reads in 8MB blocks:


ds@codeminutia:~$ sudo fio –name=sq-read –rw=read –size=10G –direct=0 –bs=8M –invalidate=1 –filename=/dev/xvdb
sq-read: (g=0): rw=read, bs=(R) 8192KiB-8192KiB, (W) 8192KiB-8192KiB, (T) 8192KiB-8192KiB, ioengine=psync, iodepth=1
fio-2.99
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=440MiB/s,w=0KiB/s][r=55,w=0 IOPS][eta 00m:00s]
sq-read: (groupid=0, jobs=1): err= 0: pid=1199: Tue Nov 28 16:57:24 2017
read: IOPS=53, BW=428MiB/s (449MB/s)(10.0GiB/23927msec)
clat (usec): min=15491, max=42372, avg=18688.55, stdev=1690.70
lat (usec): min=15492, max=42373, avg=18689.20, stdev=1690.70
clat percentiles (usec):
| 1.00th=[16319], 5.00th=[17171], 10.00th=[17695], 20.00th=[17957],
| 30.00th=[18220], 40.00th=[18220], 50.00th=[18220], 60.00th=[18482],
| 70.00th=[18744], 80.00th=[19006], 90.00th=[19792], 95.00th=[21365],
| 99.00th=[27395], 99.50th=[28443], 99.90th=[30016], 99.95th=[42206],
| 99.99th=[42206]
bw ( KiB/s): min=344064, max=458752, per=99.91%, avg=437835.68, stdev=18915.12, samples=47
iops : min= 42, max= 56, avg=53.40, stdev= 2.31, samples=47
lat (msec) : 20=91.02%, 50=8.98%
cpu : usr=0.03%, sys=31.33%, ctx=21920, majf=0, minf=525
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=1280,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s), io=10.0GiB (10.7GB), run=23927-23927msec

Disk stats (read/write):
xvdb: ios=81796/20, merge=0/2, ticks=65004/112, in_queue=65068, util=79.50%

We got 449MB/s throughput with an 8MB sequential read when readahead is set to 128K. If your application does large sequential reads like those, you may want to increase this readahead buffer to something bigger like 4MB:


ds@codeminutia:~$ sudo blockdev –setra 8192 /dev/xvdb
ds@codeminutia:~$ sudo blockdev –setfra 8192 /dev/xvdb
ds@codeminutia:~$ sudo blockdev –getra /dev/xvdb
8192
ds@codeminutia:~$ sudo blockdev –getfra /dev/xvdb
8192

Same 8MB sequential read fio test with readahead set to 4MB now achieves 1,117MB/s in throughput, which is much better than 449MB/s previously.


ds@codeminutia:~$ sudo fio –name=sq-read –rw=read –size=10G –direct=0 –bs=8M –invalidate=1 –filename=/dev/xvdb
sq-read: (g=0): rw=read, bs=(R) 8192KiB-8192KiB, (W) 8192KiB-8192KiB, (T) 8192KiB-8192KiB, ioengine=psync, iodepth=1
fio-2.99
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1024MiB/s,w=0KiB/s][r=128,w=0 IOPS][eta 00m:00s]
sq-read: (groupid=0, jobs=1): err= 0: pid=1225: Tue Nov 28 16:54:06 2017
read: IOPS=133, BW=1065MiB/s (1117MB/s)(10.0GiB/9614msec)
clat (usec): min=1476, max=27932, avg=7506.91, stdev=1225.27
lat (usec): min=1476, max=27932, avg=7507.36, stdev=1225.26
clat percentiles (usec):
| 1.00th=[ 6325], 5.00th=[ 6587], 10.00th=[ 6718], 20.00th=[ 6849],
| 30.00th=[ 6980], 40.00th=[ 7177], 50.00th=[ 7308], 60.00th=[ 7439],
| 70.00th=[ 7635], 80.00th=[ 7898], 90.00th=[ 8356], 95.00th=[ 8979],
| 99.00th=[11863], 99.50th=[13304], 99.90th=[21103], 99.95th=[27919],
| 99.99th=[27919]
bw ( MiB/s): min= 976, max= 1152, per=99.92%, avg=1064.31, stdev=50.08, samples=19
iops : min= 122, max= 144, avg=133.00, stdev= 6.24, samples=19
lat (msec) : 2=0.08%, 10=97.42%, 20=2.34%, 50=0.16%
cpu : usr=0.00%, sys=46.23%, ctx=10541, majf=0, minf=525
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=1280,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=1065MiB/s (1117MB/s), 1065MiB/s-1065MiB/s (1117MB/s-1117MB/s), io=10.0GiB (10.7GB), run=9614-9614msec

Disk stats (read/write):
xvdb: ios=80248/3, merge=0/1, ticks=239780/4, in_queue=240008, util=98.62%

This is great! Why isn’t this set to 4M by default in Ubuntu?!

Very specific applications that do sequential reads can benefit from this setting. For example, a database server with only a handful of large files is a good candidate.

Applications that do some level of random reads where the subsequent random offset falls within the readahead buffer size would also benefit. However, the reason this is not turned on by default is because it has the potential to negatively impact other applications. On each file read an application does, the Linux kernel will read 4M of data and fill the page cache with that.

Now imagine an application that has a very large number of files that it reads from. Each
small read of say 1K would end up filling up 4M of buffer cache. This cascading effect results in read amplification problems that drain the page cache. The system could potentially be spending a lot of time filling the page cache with large chunks of data that would need to be discarded quickly to make room for more data to be read in.

Also keep in mind that this is a system-wide change, so all applications accessing the same device would behave the same way.

That’s why it is best left to the system administrator to tweak this setting to what is the most optimal value for their particular application/load. Also, keep in mind that the setting is not persistent across reboots, so you might want to add the command into /etc/rc.local.

Note: Optimization and testing was performed on Oracle Cloud Infrastructure-Classic Instances of Ubuntu. Whether these optimizations would help in other environments and setups has not been pursued.