iSCSI SAN technology is kind of the same business in general terms, no matter its manufacturer, model or capacity. It’s also known that “best practice” concept is no more than a series of recommendations / configurations to bear in mind when implementing an iSCSI SAN but at this point, the way the array itself works will give us the first important thing to consider.

Storage arrays work in two main different ways:

We won’t give a deep explanation about this topic in this article

  • Active – Passive controllers:
    • The storage array is capable of managing all requested tasks with a single controller, so the second controller is in stand-by mode and will take full or partial control of those tasks (this depends on the array technology itself) at the time of a failure or controlled failover.
  • Active – Active controllers:
    • In most of the cases this won’t work as you might think. Both controllers will balance the load by assigning a LUN to each controller in order so at the end in a 20 LUN environment you will have 10 LUNs assigned to controller 0 and the other half to controller 1 (theoretically). If one of them fails, then the single remaining controller will manage the entire environment.

The above explanation is important so we can develop the main topic. iSCSI SAN best practice:

The following information is not intended to be the unique and the best guide ever for this topic but one of the simplest explanations you will find out there.

  • Array
    • RAID Level: RAID technologies and parity algorithms need a complete different article to be explained in more details but nowadays most of the storage arrays will take care of this automatically; nevertheless it’s always important to consider the type of data you’ll be storing there.
    • Power source: it’s recommended to have at least two different power source (PDU) available to one storage array to make sure you have alternative path to ensure business continuity.
  • Networking
    • Physically isolate iSCSI traffic from any other type of network traffic. In other words, use dedicated Ethernet switches for the SAN traffic.
    • Use at least 2 physical switches and in most of the cases use a port channel between them (see manufacturer’s recommendations). A stack will have those two or more switches acting like one logical unit so you will find lot of difficulties at the time of a failure.
    • If you are implementing an active-active controller storage array, then you should use different subnets for each set of ports. For instance: 192.168.10.10 for port 0 on controller 0 and 192.168.10.11 for port 0 on controller 1.
    • 168.11.10 for port 1 on controller 0 and 192.168.11.11 for port 1 on controller 1 and so on.
    • Distribute subnets between the available switches. Subnet 10 to Switch 1 and subnet 11 to Switch 2 (if you follow previous example).
    • Active – Passive arrays will work a bit different here. Most of them will only use a single subnet for all its ports since the stand by controller will take care of the entire array the same way as the original one. This includes vertical port failover, management ports, etc.
    • Jumbo Frames should be enabled on the NICs, Switch and array ports to reduce overhead and improve consistency. Please refer to the switch specs to make sure what the payload maximum transmission unit is (9,000 bytes in most of the cases)
    • Disable Unicast Storm Control on the switch side to avoid having discarding packets. Note that multicast and broadcast storm control is encouraged.
    • Enable Flow Control. This is very important to avoid retransmissions and/ or I/O performance impact. Many initiators sending data simultaneously may exceed the throughput capacity of the target (storage array) ports, this way the receiver (array) will drop packets. Enable Flow Control not only on the switch ports but the NIC itself too.
  • Host configuration
  • Not all of the following points will apply for all operating systems.
    • Delayed ACK. This is a TCP/IP way to reduce overhead by combining several acknowledgment responses into a single response (ACK).
    • Storage I/O control (VMware). Enabling this option, you will ensure that not all of the VMDKs will be affected by an excessive storage I/O demands of a specific VMDK residing on the same datastore.
    • Depending on your needs, please make sure to select the right Multipath I/O Policy.
    • Login Timeout. This is the time the host will wait to receive a response from the storage array. By changing the value from the default 5 seconds to 60 seconds (recommended) you’ll be allowing the host to not to give up so soon when trying to connect which will reflect in a better behavior in front of a failure / failover situation.
    • Noop Timeout should be set to 30 seconds (default= 10 seconds) to allow more time to establish non active paths to the volume; after this time the path will be marked as dead

As mentioned before, this is not a complete guide but it will give you important key points and a quick explanation of each of them.