Improving Sequential Read on Oracle Cloud Infrastructure-Classic instances of Ubuntu Linux

When a distribution is compiled and packaged, it is somewhat unknown how it will be used by any particular end user. An optimization or a setting that might benefit one type of I/O load can potentially have a negative impact on another type of load.

Ubuntu Image for Oracle Cloud Infrastructure-Classic is capable of running various I/O load types and by default comes preconfigured to run as wide of an I/O load matrix as possible.

But as such, there might be areas where optimizations could be made. One such case is in large sequential read operations.

System administrators and/or application developers are the ones usually privy to information on how a particular operating system instance will be used.

This article intends is to help those power users of Ubuntu Instances on Oracle Cloud Infrastructure-Classic to optimize sequential read I/O performance of their instances.

1. Function posix_fadvise():

If your system I/O load involves reading contiguous and sequential chunks of large files, you may benefit from making the operating system aware of this.

The System call interface of a Linux operating system provides an API for applications to hint to the Linux kernel what type of file access an application will perform. That can be achieved by the use of fadvise64() system call or the wrapper function posix_fadvise():


NAME
      posix_fadvise - predeclare an access pattern for file data
SYNOPSIS
      #include 
      int posix_fadvise(int fd, off_t offset, off_t len, int advice);

2. Function readahead():

Or the application can request a certain amount of page cache pages be preloaded with the data from a particular file before it starts issuing read() system calls. This can be achieved by readahead():


NAME
      readahead - initiate file readahead into page cache
SYNOPSIS
      #define _GNU_SOURCE             /* See feature_test_macros(7) */
      #include 
      ssize_t readahead(int fd, off64_t offset, size_t count);

3. Command blockdev:

While system calls above might work for application developers, but they won’t work for system administrators who are not developers of the application they are administering. Luckily, there is another way to optimize sequential read and that is through blockdev command:


NAME
      blockdev - call block device ioctls from the command line
SYNOPSIS
      blockdev [-q] [-v] command [command...] device [device...]
      blockdev --report [device...] 

With blockdev, we can query the operating system readahead size using the –getra and –getfra command parameters:


      --getfra
             Get filesystem readahead in 512-byte sectors.
      --getra
             Print readahead (in 512-byte sectors).

On Ubuntu 16.04, the readahead parameters are set to 256 512-byte sectors(128K):


ds@codeminutia:~$ sudo blockdev –getra /dev/xvdb
256
ds@codeminutia:~$ sudo blockdev –getfra /dev/xvdb
256

That means that on every file read() an application does, the Linux kernel will actually read 128K worth of data from storage and store that data in the page cache. This happens even if you called read() requesting only 1K from the file.

The operating system does this in anticipation that the application is going to do another read(1K) system call and this read is going to be much faster because it can be serviced from the buffer cache instead of going all the way down to the storage backend.

But sometimes applications read sequential data in larger chunks than 128K. The fio test below simulates sequential reads in 8MB blocks:


ds@codeminutia:~$ sudo fio –name=sq-read –rw=read –size=10G –direct=0 –bs=8M –invalidate=1 –filename=/dev/xvdb
sq-read: (g=0): rw=read, bs=(R) 8192KiB-8192KiB, (W) 8192KiB-8192KiB, (T) 8192KiB-8192KiB, ioengine=psync, iodepth=1
fio-2.99
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=440MiB/s,w=0KiB/s][r=55,w=0 IOPS][eta 00m:00s]
sq-read: (groupid=0, jobs=1): err= 0: pid=1199: Tue Nov 28 16:57:24 2017
read: IOPS=53, BW=428MiB/s (449MB/s)(10.0GiB/23927msec)
clat (usec): min=15491, max=42372, avg=18688.55, stdev=1690.70
lat (usec): min=15492, max=42373, avg=18689.20, stdev=1690.70
clat percentiles (usec):
| 1.00th=[16319], 5.00th=[17171], 10.00th=[17695], 20.00th=[17957],
| 30.00th=[18220], 40.00th=[18220], 50.00th=[18220], 60.00th=[18482],
| 70.00th=[18744], 80.00th=[19006], 90.00th=[19792], 95.00th=[21365],
| 99.00th=[27395], 99.50th=[28443], 99.90th=[30016], 99.95th=[42206],
| 99.99th=[42206]
bw ( KiB/s): min=344064, max=458752, per=99.91%, avg=437835.68, stdev=18915.12, samples=47
iops : min= 42, max= 56, avg=53.40, stdev= 2.31, samples=47
lat (msec) : 20=91.02%, 50=8.98%
cpu : usr=0.03%, sys=31.33%, ctx=21920, majf=0, minf=525
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=1280,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s), io=10.0GiB (10.7GB), run=23927-23927msec

Disk stats (read/write):
xvdb: ios=81796/20, merge=0/2, ticks=65004/112, in_queue=65068, util=79.50%

We got 449MB/s throughput with an 8MB sequential read when readahead is set to 128K. If your application does large sequential reads like those, you may want to increase this readahead buffer to something bigger like 4MB:


ds@codeminutia:~$ sudo blockdev –setra 8192 /dev/xvdb
ds@codeminutia:~$ sudo blockdev –setfra 8192 /dev/xvdb
ds@codeminutia:~$ sudo blockdev –getra /dev/xvdb
8192
ds@codeminutia:~$ sudo blockdev –getfra /dev/xvdb
8192

Same 8MB sequential read fio test with readahead set to 4MB now achieves 1,117MB/s in throughput, which is much better than 449MB/s previously.


ds@codeminutia:~$ sudo fio –name=sq-read –rw=read –size=10G –direct=0 –bs=8M –invalidate=1 –filename=/dev/xvdb
sq-read: (g=0): rw=read, bs=(R) 8192KiB-8192KiB, (W) 8192KiB-8192KiB, (T) 8192KiB-8192KiB, ioengine=psync, iodepth=1
fio-2.99
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=1024MiB/s,w=0KiB/s][r=128,w=0 IOPS][eta 00m:00s]
sq-read: (groupid=0, jobs=1): err= 0: pid=1225: Tue Nov 28 16:54:06 2017
read: IOPS=133, BW=1065MiB/s (1117MB/s)(10.0GiB/9614msec)
clat (usec): min=1476, max=27932, avg=7506.91, stdev=1225.27
lat (usec): min=1476, max=27932, avg=7507.36, stdev=1225.26
clat percentiles (usec):
| 1.00th=[ 6325], 5.00th=[ 6587], 10.00th=[ 6718], 20.00th=[ 6849],
| 30.00th=[ 6980], 40.00th=[ 7177], 50.00th=[ 7308], 60.00th=[ 7439],
| 70.00th=[ 7635], 80.00th=[ 7898], 90.00th=[ 8356], 95.00th=[ 8979],
| 99.00th=[11863], 99.50th=[13304], 99.90th=[21103], 99.95th=[27919],
| 99.99th=[27919]
bw ( MiB/s): min= 976, max= 1152, per=99.92%, avg=1064.31, stdev=50.08, samples=19
iops : min= 122, max= 144, avg=133.00, stdev= 6.24, samples=19
lat (msec) : 2=0.08%, 10=97.42%, 20=2.34%, 50=0.16%
cpu : usr=0.00%, sys=46.23%, ctx=10541, majf=0, minf=525
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwt: total=1280,0,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
READ: bw=1065MiB/s (1117MB/s), 1065MiB/s-1065MiB/s (1117MB/s-1117MB/s), io=10.0GiB (10.7GB), run=9614-9614msec

Disk stats (read/write):
xvdb: ios=80248/3, merge=0/1, ticks=239780/4, in_queue=240008, util=98.62%

This is great! Why isn’t this set to 4M by default in Ubuntu?!

Very specific applications that do sequential reads can benefit from this setting. For example, a database server with only a handful of large files is a good candidate.

Applications that do some level of random reads where the subsequent random offset falls within the readahead buffer size would also benefit. However, the reason this is not turned on by default is because it has the potential to negatively impact other applications. On each file read an application does, the Linux kernel will read 4M of data and fill the page cache with that.

Now imagine an application that has a very large number of files that it reads from. Each
small read of say 1K would end up filling up 4M of buffer cache. This cascading effect results in read amplification problems that drain the page cache. The system could potentially be spending a lot of time filling the page cache with large chunks of data that would need to be discarded quickly to make room for more data to be read in.

Also keep in mind that this is a system-wide change, so all applications accessing the same device would behave the same way.

That’s why it is best left to the system administrator to tweak this setting to what is the most optimal value for their particular application/load. Also, keep in mind that the setting is not persistent across reboots, so you might want to add the command into /etc/rc.local.

Note: Optimization and testing was performed on Oracle Cloud Infrastructure-Classic Instances of Ubuntu. Whether these optimizations would help in other environments and setups has not been pursued.

Porting Fio to OpenWRT

OpenWRT is a favorite flavor of Linux intended for embedded devices such as firewalls, routers and so forth. Increasingly those devices based on various ARM SoCs are appearing with integrated SATA controllers. That, in turn, pushes use cases of such devices more and more into fileservers, backups, media server, etc.

If you’re going to use an ARM SoC as a file server, it would be nice to know the performance of your storage backend. There is a good storage performance testing Linux program called Fio. Unfortunately, fio is not a part of the OpenWRT Linux flavor but luckily porting an application over to OpenWRT can be quite simple.

1. Clone OpenWRT git repository


ds@codeminutia:~/src$ git clone https://git.openwrt.org/openwrt/openwrt.git

2. Decide on a good place where to add the application, package/utils seems like a reasonable place:


ds@codeminutia:~/src/openwrt$ mkdir package/utils/fio

3. Create a Makefile under package/utils/fio


ds@codeminutia:~/src/openwrt$ vi package/utils/fio/Makefile

4. The makefile can be copied from another package in OpenWRT, you will need to adjust some variables. Package name and release versions, PKG_VERSION is fio version; the PKG_RELEASE is OpenWRT release number.

PKG_NAME:=fio
PKG_VERSION:=3.6
PKG_RELEASE:=1

5. Variables below tell the build system where to get this software. We use git to fetch the exact revision of fio 3.6. The PKG_MIRROR_HASH is the hash of source file when it is pulled from git and compressed into fio-3.6.tar.xz.

PKG_SOURCE_PROTO:=git
PKG_SOURCE_URL:=https://github.com/axboe/fio.git
PKG_SOURCE_VERSION:=c5477c6a3b3e0042b1f74414071429ca66d94c2f
PKG_SOURCE_DATE:=2018-04-16
PKG_SOURCE_SUBDIR:=$(PKG_NAME)-$(PKG_VERSION)
PKG_MIRROR_HASH:=""

6. The tricky part is PKG_MIRROR_HASH variable; you can put an empty string in there and then run make. This will download the source from git and create io-3.6.tar.xz:


ds@codeminutia:~/src/openwrt$ make package/utils/fio/download V=s

7. Now we need to fix up the HASH so we run:


ds@codeminutia:~/src/openwrt$ make package/utils/fio/check V=s FIXUP=1

8. That updated PKG_MIRROR_HASH variable in your makefile.

PKG_SOURCE_PROTO:=git
PKG_SOURCE_URL:=https://github.com/axboe/fio.git
PKG_SOURCE_VERSION:=c5477c6a3b3e0042b1f74414071429ca66d94c2f
PKG_SOURCE_DATE:=2018-04-16
PKG_SOURCE_SUBDIR:=$(PKG_NAME)-$(PKG_VERSION)
PKG_MIRROR_HASH:=eecd190a100ccf803575a7f0027c111fb3ff322b05b98b83f01ad88039b33741

9. Section below tells the build system where to place the menu option when you use “make menuconfig”. We are placing fio under “Utilities” menu.

define Package/fio
  SECTION:=utils
  CATEGORY:=Utilities
  TITLE:=Flexible I/O Tester
  URL:=https://github.com/axboe/fio/
  DEPENDS:= +zlib
endef

10. This is not needed in all cases, but in our case, we make sure that configure script in fio gets the right parameters by exporting proper CONFIGURE_ARGS

# Remove quotes from CONFIG_CPU_TYPE
CONFIG_CPU_TYPE_NQ:= $(patsubst "%",%,$(CONFIG_CPU_TYPE))
# get CPU type so we can pass it to fio configure script
CONF_CPU:=$(firstword $(subst +, ,$(CONFIG_CPU_TYPE_NQ)))

CONFIGURE_ARGS = --prefix="$(PKG_INSTALL_DIR)" --cpu="$(CONF_CPU)" --cc="$(TARGET_CC)" --extra-cflags="$(TARGET_CFLAGS)"

You can see the full Makefile below:

#
# Copyright (C) 2018 - <ds@codeminutia.com>
#
# This is free software, licensed under the GNU General Public License v2.
# See /LICENSE for more information.
#

include $(TOPDIR)/rules.mk

PKG_NAME:=fio
PKG_VERSION:=3.6
PKG_RELEASE:=1

PKG_SOURCE_PROTO:=git
PKG_SOURCE_URL:=https://github.com/axboe/fio.git
PKG_SOURCE_VERSION:=c5477c6a3b3e0042b1f74414071429ca66d94c2f
PKG_SOURCE_DATE:=2018-04-16
PKG_SOURCE_SUBDIR:=$(PKG_NAME)-$(PKG_VERSION)
PKG_MIRROR_HASH:=eecd190a100ccf803575a7f0027c111fb3ff322b05b98b83f01ad88039b33741

PKG_MAINTAINER:=<ds@codeminutia.com>
PKG_LICENSE:=GPL-2.0
PKG_LICENSE_FILES:=COPYING

PKG_BUILD_DEPENDS:= +zlib

HOST_BUILD_DIR:=$(BUILD_DIR_HOST)/$(PKG_NAME)-$(PKG_VERSION)
PKG_BUILD_DIR:=$(BUILD_DIR)/$(PKG_NAME)-$(PKG_VERSION)

include $(INCLUDE_DIR)/host-build.mk
include $(INCLUDE_DIR)/package.mk

define Package/fio
  SECTION:=utils
  CATEGORY:=Utilities
  TITLE:=Flexible I/O Tester
  URL:=https://github.com/axboe/fio/
  DEPENDS:= +zlib
endef

define Package/fio/description
 Flexible I/O Tester

 Fio was originally written to save me the hassle of writing special test case
 programs when I wanted to test a specific workload, either for performance
 reasons or to find/reproduce a bug. The process of writing such a test app can
 be tiresome, especially if you have to do it often.  Hence I needed a tool that
 would be able to simulate a given I/O workload without resorting to writing a
 tailored test case again and again.

 A test work load is difficult to define, though. There can be any number of
 processes or threads involved, and they can each be using their own way of
 generating I/O. You could have someone dirtying large amounts of memory in an
 memory mapped file, or maybe several threads issuing reads using asynchronous
 I/O. fio needed to be flexible enough to simulate both of these cases, and many
 more.

 Fio spawns a number of threads or processes doing a particular type of I/O
 action as specified by the user. fio takes a number of global parameters, each
 inherited by the thread unless otherwise parameters given to them overriding
 that setting is given.  The typical use of fio is to write a job file matching
 the I/O load one wants to simulate.

 Author: Jens Axboe <axboe@kernel.dk>
 Package Maintainer: <ds@codeminutia.com> http://codeminutia.com/

endef

# Remove quotes from CONFIG_CPU_TYPE
CONFIG_CPU_TYPE_NQ:= $(patsubst "%",%,$(CONFIG_CPU_TYPE))
# get CPU type so we can pass it to fio configure script
CONF_CPU:=$(firstword $(subst +, ,$(CONFIG_CPU_TYPE_NQ)))

CONFIGURE_ARGS = --prefix="$(PKG_INSTALL_DIR)" --cpu="$(CONF_CPU)" --cc="$(TARGET_CC)" --extra-cflags="$(TARGET_CFLAGS)"

define Build/Compile
        $(MAKE) -C $(PKG_BUILD_DIR)
        $(MAKE) -C $(PKG_BUILD_DIR) INSTALL_PREFIX="$(PKG_INSTALL_DIR)" install
endef

define Package/fio/install
        $(INSTALL_DIR) $(1)/usr/bin
        $(INSTALL_BIN) $(PKG_BUILD_DIR)/$(PKG_NAME) $(1)/usr/bin/
endef

$(eval $(call BuildPackage,fio, +zlib))