| Rocker Network Switch Register Programming Guide |
| Copyright (c) Scott Feldman <sfeldma@gmail.com> |
| Copyright (c) Neil Horman <nhorman@tuxdriver.com> |
| Version 0.11, 12/29/2014 |
| |
| LICENSE |
| ======= |
| |
| This program is free software; you can redistribute it and/or modify |
| it under the terms of the GNU General Public License as published by |
| the Free Software Foundation; either version 2 of the License, or |
| (at your option) any later version. |
| |
| This program is distributed in the hope that it will be useful, |
| but WITHOUT ANY WARRANTY; without even the implied warranty of |
| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| GNU General Public License for more details. |
| |
| SECTION 1: Introduction |
| ======================= |
| |
| Overview |
| -------- |
| |
| This document describes the hardware/software interface for the Rocker switch |
| device. The intended audience is authors of OS drivers and device emulation |
| software. |
| |
| Notations and Conventions |
| ------------------------- |
| |
| o In register descriptions, [n:m] indicates a range from bit n to bit m, |
| inclusive. |
| o Use of leading 0x indicates a hexadecimal number. |
| o Use of leading 0b indicates a binary number. |
| o The use of RSVD or Reserved indicates that a bit or field is reserved for |
| future use. |
| o Field width is in bytes, unless otherwise noted. |
| o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear |
| on read |
| o TLV values in network-byte-order are designated with (N). |
| |
| |
| SECTION 2: PCI Configuration Registers |
| ====================================== |
| |
| PCI Configuration Space |
| ----------------------- |
| |
| Each switch instance registers as a PCI device with PCI configuration space: |
| |
| offset width description value |
| --------------------------------------------- |
| 0x0 2 Vendor ID 0x1b36 |
| 0x2 2 Device ID 0x0006 |
| 0x4 4 Command/Status |
| 0x8 1 Revision ID 0x01 |
| 0x9 3 Class code 0x2800 |
| 0xC 1 Cache line size |
| 0xD 1 Latency timer |
| 0xE 1 Header type |
| 0xF 1 Built-in self test |
| 0x10 4 Base address low |
| 0x14 4 Base address high |
| 0x18-28 Reserved |
| 0x2C 2 Subsystem vendor ID * |
| 0x2E 2 Subsystem ID * |
| 0x30-38 Reserved |
| 0x3C 1 Interrupt line |
| 0x3D 1 Interrupt pin 0x00 |
| 0x3E 1 Min grant 0x00 |
| 0x3D 1 Max latency 0x00 |
| 0x40 1 TRDY timeout |
| 0x41 1 Retry count |
| 0x42 2 Reserved |
| |
| |
| * Assigned by sub-system implementation |
| |
| SECTION 3: Memory-Mapped Register Space |
| ======================================= |
| |
| There are two memory-mapped BARs. BAR0 maps device register space and is |
| 0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in |
| size, allowing for 256 MSI-X vectors. |
| |
| All registers are 4 or 8 bytes long. It is assumed host software will access 4 |
| byte registers with one 4-byte access, and 8 byte registers with either two |
| 4-byte accesses or a single 8-byte access. In the case of two 4-byte accesses, |
| access must be lower and then upper 4-bytes, in that order. |
| |
| BAR0 device register space is organized as follows: |
| |
| offset description |
| ------------------------------------------------------ |
| 0x0000-0x000f Bogus registers to catch misbehaving |
| drivers. Writes do nothing. Reads |
| back as 0xDEADBABE. |
| 0x0010-0x00ff Test registers |
| 0x0300-0x03ff General purpose registers |
| 0x1000-0x1fff Descriptor control |
| |
| Holes in register space are reserved. Writes to reserved registers do nothing. |
| Reads to reserved registers read back as 0. |
| |
| No fancy stuff like write-combining is enabled on any of the registers. |
| |
| BAR1 MSI-X register space is organized as follows: |
| |
| offset description |
| ------------------------------------------------------ |
| 0x0000-0x0fff MSI-X vector table (256 vectors total) |
| 0x1000-0x1fff MSI-X PBA table |
| |
| |
| SECTION 4: Interrupts, DMA, and Endianness |
| ========================================== |
| |
| PCI Interrupts |
| -------------- |
| |
| The device supports only MSI-X interrupts. BAR1 memory-mapped region contains |
| the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors. |
| |
| The vector assignment is: |
| |
| vector description |
| ----------------------------------------------------- |
| 0 Command descriptor ring completion |
| 1 Event descriptor ring completion |
| 2 Test operation completion |
| 3 RSVD |
| 4-255 Tx and Rx descriptor ring completion |
| Tx vector is even |
| Rx vector is odd |
| |
| A MSI-X vector table entry is 16 bytes: |
| |
| field offset width description |
| ------------------------------------------------------------- |
| lower_addr 0x0 4 [31:2] message address[31:2] |
| [1:0] Rsvd (4 byte alignment |
| required) |
| upper_addr 0x4 4 [31:19] Rsvd |
| [14:0] message address[46:32] |
| data 0x8 4 message data[31:0] |
| control 0xc 4 [31:1] Rsvd |
| [0] mask (0 = enable, |
| 1 = masked) |
| |
| Software should install the Interrupt Service Routine (ISR) before any ports |
| are enabled or any commands are issued on the command ring. |
| |
| DMA Operations |
| -------------- |
| |
| DMA operations are used for packet DMA to/from the CPU, command and event |
| processing. Command processing includes statistical counters and table dumps, |
| table insertion/deletion, and more. Event processing provides an async |
| notification method for device-originating events. Each DMA operation has a |
| set of control registers to manage a descriptor ring. The descriptor rings are |
| allocated from contiguous host DMA-able memory and registers specify the rings |
| base address, size and current head and tail indices. Software always writes |
| the head, and hardware always writes the tail. |
| |
| The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion |
| of a descriptor. Software will clear this bit when posting a descriptor to the |
| ring, and hardware will set this bit when the descriptor is complete. |
| |
| Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries. |
| Descriptor rings' base address must be 8-byte aligned. Descriptors must be |
| packed within ring. Each descriptor in each ring must also be aligned on an 8 |
| byte boundary. Each descriptor ring will have these registers: |
| |
| DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W) |
| DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W) |
| DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W) |
| DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R) |
| DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W) |
| DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W) |
| DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W) |
| |
| Where x is descriptor ring index: |
| |
| index ring |
| -------------------- |
| 0 CMD |
| 1 EVENT |
| 2 TX (port 0) |
| 3 RX (port 0) |
| 4 TX (port 1) |
| 5 RX (port 1) |
| . |
| . |
| . |
| 124 TX (port 61) |
| 125 RX (port 61) |
| 126 Resv |
| 127 Resv |
| |
| Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be |
| written past TAIL. To do so would wrap the ring. An empty ring is when HEAD |
| == TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and |
| TAIL increment and modulo wrap at the ring size. |
| |
| CTRL register bits: |
| |
| bit name description |
| ------------------------------------------------------------------------ |
| [0] CTRL_RESET Reset the descriptor ring |
| [1:31] Reserved |
| |
| All descriptor types share some common fields: |
| |
| field width description |
| ------------------------------------------------------------------- |
| DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte |
| aligned |
| DMA_DESC_COOKIE 8 Desc cookie for completion matching, |
| upper-most bit is reserved |
| DMA_DESC_BUF_SIZE 2 Desc payload size in bytes |
| DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes |
| used for TLVs. Must be <= |
| DMA_DESC_BUF_SIZE. |
| DMA_DESC_COMP_ERR 2 Completion status of associated |
| desc payload. High order bit is |
| clear on new descs, toggled by |
| hw for completed items. |
| |
| To support forward- and backward-compatibility, descriptor and completion |
| payloads are specified in TLV format. Fields are packed with Type=field name, |
| Length=field length, and Value=field value. Software will ignore unknown fields |
| filled in by the switch. Likewise, the switch will ignore unknown fields |
| filled in by software. |
| |
| Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The |
| value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is: |
| |
| field width description |
| ----------------------------- |
| type 4 TLV type |
| len 2 TLV value length |
| pad 2 Reserved |
| |
| The alignment requirements for descriptors and TLVs are to avoid unaligned |
| access exceptions in software. Note that the payload for each TLV is also |
| 8 byte aligned. |
| |
| Figure 1 shows an example descriptor buffer with two TLVs. |
| |
| <------- 8 bytes -------> |
| |
| 8-byte +––––+ +–––––––––––+–––––+–––––+ +–+ |
| align | type | len | pad | TLV#1 hdr | |
| +–––––––––––+–––––+–––––+ (len=22) | |
| | | | |
| | value | TVL#1 value | |
| | | (padded to 8-byte | |
| | +–––––+ alignment) | |
| | |/////| | |
| 8-byte +––––+ +–––––––––––+–––––––––––+ | |
| align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE |
| +–––––+–––––+–––––+–––––+ (len=2) | |
| |value|/////////////////| TLV#2 value | |
| +–––––+/////////////////| | |
| |///////////////////////| | |
| |///////////////////////| | |
| |///////////////////////| | |
| |////////unused/////////| | |
| |////////space//////////| | |
| |///////////////////////| | |
| |///////////////////////| | |
| |///////////////////////| | |
| +–––––––––––––––––––––––+ +–+ |
| |
| fig. 1 |
| |
| TLVs can be nested within the NEST TLV type. |
| |
| Interrupt credits |
| ^^^^^^^^^^^^^^^^^ |
| |
| MSI-X vectors used for descriptor ring completions use a credit mechanism for |
| efficient device, PCIe bus, OS and driver operations. Each descriptor ring has |
| a credit count which represents the number of outstanding descriptors to be |
| processed by the driver. As the device marks descriptors complete, the credit |
| count is incremented. As the driver processes those outstanding descriptors, |
| it returns credits back to the device. This way, the device knows the driver's |
| progress and can make decisions about when to fire the next interrupt or not. |
| When the credit count is zero, and the first descriptors are posted for the |
| driver, a single interrupt is fired. Once the interrupt is fired, the |
| interrupt is disabled (auto-masked*). In response to the interrupt, the driver |
| will process descriptors and PIO write a returned credit value for that |
| descriptor ring. If the driver returns all credits (the driver caught up with |
| the device and there is no outstanding work), then the interrupt is unmasked, |
| but not fired. If only partial credits are returned, the interrupt remains |
| masked but the device generates an interrupt, signaling the driver that more |
| outstanding work is available. |
| |
| (* this masking is unrelated to the MSI-X interrupt mask register) |
| |
| Endianness |
| ---------- |
| |
| Device registers are hard-coded to little-endian (LE). The driver should |
| convert to/from host endianness to LE for device register accesses. |
| |
| Descriptors are LE. Descriptor buffer TLVs will have LE type and length |
| fields, but the value field can either be LE or network-byte-order, depending |
| on context. TLV values containing network packet data will be in network-byte |
| order. A TLV value containing a field or mask used to compare against network |
| packet data is network-byte order. For example, flow match fields (and masks) |
| are network-byte-order since they're matched directly, byte-by-byte, against |
| network packet data. All non-network-packet TLV multi-byte values will be LE. |
| |
| TLV values in network-byte-order are designated with (N). |
| |
| |
| SECTION 5: Test Registers |
| ========================= |
| |
| Rocker has several test registers to support troubleshooting register access, |
| interrupt generation, and DMA operations: |
| |
| TEST_REG, offset 0x0010, 32-bit (R/W) |
| TEST_REG64, offset 0x0018, 64-bit (R/W) |
| TEST_IRQ, offset 0x0020, 32-bit (R/W) |
| TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W) |
| TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W) |
| TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W) |
| |
| Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last |
| value written to the register. The 32-bit and 64-bit versions are for testing |
| 32-bit and 64-bit host accesses. |
| |
| A vector can be written to TEST_IRQ and the device will generate an interrupt |
| for that vector. |
| |
| To test basic DMA operations, allocate a DMA-able host buffer and put the |
| buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to |
| TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are: |
| |
| operation value description |
| ----------------------------------------------------------- |
| TEST_DMA_CTRL_CLEAR 1 clear buffer |
| TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96 |
| TEST_DMA_CTRL_INVERT 4 invert bytes in buffer |
| |
| Various buffer address and sizes should be tested to verify no address boundary |
| issue exists. In particular, buffers that start on odd-8-byte boundary and/or |
| span multiple PAGE sizes should be tested. |
| |
| |
| SECTION 6: Ports |
| ================ |
| |
| Physical and Logical Ports |
| ------------------------------------ |
| |
| The switch supports up to 62 physical (front-panel) ports. Register |
| PORT_PHYS_COUNT returns the actual number of physical ports available: |
| |
| PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R) |
| |
| In addition to front-panel ports, the switch supports logical ports for |
| tunnels. |
| |
| Front-panel ports and logical tunnel ports are mapped into a single 32-bit port |
| space. A special CPU port is assigned port 0. The front-panel ports are |
| mapped to ports 1-62. A special loopback port is assigned port 63. Logical |
| tunnel ports are assigned ports 0x0001000-0x0001ffff. |
| To summarize the port assignments: |
| |
| port mapping |
| ------------------------------------------------------- |
| 0 CPU port (for packets to/from host CPU) |
| 1-62 front-panel physical ports |
| 63 loopback port |
| 64-0x0000ffff RSVD |
| 0x00010000-0x0001ffff logical tunnel ports |
| 0x00020000-0xffffffff RSVD |
| |
| Physical Port Mode |
| ------------------ |
| |
| Switch front-panel ports operate in a mode. Currently, the only mode is |
| OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA) |
| Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To |
| set/get the mode for front-panel ports, see port settings, below. |
| |
| Port Settings |
| ------------- |
| |
| Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS: |
| |
| PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R) |
| |
| Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62 |
| read 1 for link UP and 0 for link DOWN for respective front-panel ports. |
| |
| Other properties for front-panel ports are available via DMA CMD descriptors: |
| |
| Get PORT_SETTINGS descriptor: |
| |
| field width description |
| ---------------------------------------------- |
| PORT_SETTINGS 2 CMD_GET |
| PPORT 4 Physical port # |
| |
| Get PORT_SETTINGS completion: |
| |
| field width description |
| ---------------------------------------------- |
| PPORT 4 Physical port # |
| SPEED 4 Current port interface speed, in Mbps |
| DUPLEX 1 1 = Full, 0 = Half |
| AUTONEG 1 1 = enabled, 0 = disabled |
| MACADDR 6 Port MAC address |
| MODE 1 0 = OF-DPA |
| LEARNING 1 MAC address learning on port |
| 1 = enabled |
| 0 = disabled |
| PHYS_NAME <var> Physical port name (string) |
| |
| Set PORT_SETTINGS descriptor: |
| |
| field width description |
| ---------------------------------------------- |
| PORT_SETTINGS 2 CMD_SET |
| PPORT 4 Physical port # |
| SPEED 4 Port interface speed, in Mbps |
| DUPLEX 1 1 = Full, 0 = Half |
| AUTONEG 1 1 = enabled, 0 = disabled |
| MACADDR 6 Port MAC address |
| MODE 1 0 = OF-DPA |
| |
| Port Enable |
| ----------- |
| |
| Front-panel ports are initially disabled, which means port ingress and egress |
| packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE: |
| |
| PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W) |
| |
| Value is bitmap of first 64 ports. Bits 0 and 63 are ignored |
| and always read as 0. Write 1 to enable port; write 0 to disable it. |
| Default is 0. |
| |
| |
| SECTION 7: Switch Control |
| ========================= |
| |
| This section covers switch-wide register settings. |
| |
| Control |
| ------- |
| |
| This register is used for low level control of the switch. |
| |
| CONTROL: offset 0x0300, 32-bit, (W) |
| |
| bit name description |
| ------------------------------------------------------------------------ |
| [0] CONTROL_RESET If set, device will perform reset |
| [1:31] Reserved |
| |
| Switch ID |
| --------- |
| |
| The switch has a SWITCH_ID to be used by software to uniquely identify the |
| switch: |
| |
| SWITCH_ID: offset 0x0320, 64-bit, (R) |
| |
| Value is opaque to switch software and no special encoding is implied. |
| |
| |
| SECTION 8: Events |
| ================= |
| |
| Non-I/O asynchronous events from the device are notified to the host using the |
| event ring. The TLV structure for events is: |
| |
| field width description |
| --------------------------------------------------- |
| TYPE 4 Event type, one of: |
| 1: LINK_CHANGED |
| 2: MAC_VLAN_SEEN |
| INFO <nest> Event info (details below) |
| |
| Link Changed Event |
| ------------------ |
| |
| When link status changes on a physical port, this event is generated. |
| |
| field width description |
| --------------------------------------------------- |
| INFO <nest> |
| PPORT 4 Physical port |
| LINKUP 1 Link status: |
| 0: down |
| 1: up |
| |
| MAC VLAN Seen Event |
| ------------------- |
| |
| When a packet ingresses on a port and the source MAC/VLAN isn't known to the |
| device, the device will generate this event. In response to the event, the |
| driver should install to the device the MAC/VLAN on the port into the bridge |
| table. Once installed, the MAC/VLAN is known on the port and this event will |
| no longer be generated. |
| |
| field width description |
| --------------------------------------------------- |
| INFO <nest> |
| PPORT 4 Physical port |
| MAC 6 MAC address |
| VLAN 2 VLAN ID |
| |
| |
| SECTION 9: CPU Packet Processing |
| ================================ |
| |
| Ingress packets directed to the host CPU for further processing are delivered |
| in the DMA RX ring. Likewise, host CPU originating packets destined to egress |
| on switch ports are scheduled by software using the DMA TX ring. |
| |
| Tx Packet Processing |
| -------------------- |
| |
| Software schedules packets for egress on switch ports using the DMA TX ring. A |
| TX descriptor buffer describes the packet location and size in host DMA-able |
| memory, the destination port, and any hardware-offload functions (such as L3 |
| payload checksum offload). Software then bumps the descriptor head to signal |
| hardware of new Tx work. In response, hardware will DMA read Tx descriptors up |
| to head, DMA read descriptor buffer and packet data, perform offloading |
| functions, and finally frame packet on wire (network). Once packet processing |
| is complete, hardware will writeback status to descriptor(s) to signal to |
| software that Tx is complete and software resources (e.g. skb) backing packet |
| can be released. |
| |
| Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A |
| TLV is used for each packet fragment. |
| |
| pkt frag 1 |
| +–––––––+ +–+ |
| +–––+ | | |
| desc buf | | | | |
| +––––––––+ | | | | |
| Tx ring +–––+ +–––––+ | | | |
| +–––––––––+ | | TLVs | +–––––––+ | |
| | +–––+ +––––––––+ pkt frag 2 | |
| | desc 0 | | +–––––+ +–––––––+ | |
| +–––––––––+ | TLVs | +–––+ | | |
| head+–+ | +––––––––+ | | | |
| | desc 1 | | +–––––+ +–––––––+ |pkt |
| +–––––––––+ | TLVs | | | |
| | | +––––––––+ | pkt frag 3 | |
| | | | +–––––––+ | |
| +–––––––––+ +–––+ | | |
| | | | | | |
| | | | | | |
| +–––––––––+ | | | |
| | | | | | |
| | | | | | |
| +–––––––––+ | | | |
| | | +–––––––+ +–+ |
| | | |
| +–––––––––+ |
| |
| fig 2. |
| |
| The TLVs for Tx descriptor buffer are: |
| |
| field width description |
| --------------------------------------------------------------------- |
| PPORT 4 Destination physical port # |
| TX_OFFLOAD 1 Hardware offload modes: |
| 0: no offload |
| 1: insert IP csum (ipv4 only) |
| 2: insert TCP/UDP csum |
| 3: L3 csum calc and insert |
| into csum offset (TX_L3_CSUM_OFF) |
| 16-bit 1's complement csum value. |
| IPv4 pseudo-header and IP |
| already calculated by OS |
| and inserted. |
| 4: TSO (TCP Segmentation Offload) |
| TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset, |
| from the beginning of the packet, |
| of the csum field in the L3 header |
| TX_TSO_MSS 2 For TSO offload mode, the |
| Maximum Segment Size in bytes |
| TX_TSO_HDR_LEN 2 For TSO offload mode, the |
| length of ethernet, IP, and |
| TCP/UDP headers, including IP |
| and TCP options. |
| TX_FRAGS <array> Packet fragments |
| TX_FRAG <nest> Packet fragment |
| TX_FRAG_ADDR 8 DMA address of packet fragment |
| TX_FRAG_LEN 2 Packet fragment length |
| |
| Possible status return codes in descriptor on completion are: |
| |
| DESC_COMP_ERR reason |
| -------------------------------------------------------------------- |
| 0 OK |
| -ROCKER_ENXIO address or data read err on desc buf or packet |
| fragment |
| -ROCKER_EINVAL bad pport or TSO or csum offloading error |
| -ROCKER_ENOMEM no memory for internal staging tx fragment |
| |
| Rx Packet Processing |
| -------------------- |
| |
| For packets ingressing on switch ports that are not forwarded by the switch but |
| rather directed to the host CPU for further processing are delivered in the DMA |
| RX ring. Rx descriptor buffers are allocated by software and placed on the |
| ring. Hardware will fill Rx descriptor buffers with packet data, write the |
| completion, and signal to software that a new packet is ready. Since Rx packet |
| size is not known a-priori, the Rx descriptor buffer must be allocated for |
| worst-case packet size. A single Rx descriptor will contain the entire Rx |
| packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads |
| performed on the packet, such as checksum validation. |
| |
| The TLVs for Rx descriptor buffer are: |
| |
| field width description |
| --------------------------------------------------- |
| PPORT 4 Source physical port # |
| RX_FLAGS 2 Packet parsing flags: |
| (1 << 0): IPv4 packet |
| (1 << 1): IPv6 packet |
| (1 << 2): csum calculated |
| (1 << 3): IPv4 csum good |
| (1 << 4): IP fragment |
| (1 << 5): TCP packet |
| (1 << 6): UDP packet |
| (1 << 7): TCP/UDP csum good |
| (1 << 8): Offload forward |
| RX_CSUM 2 IP calculated checksum: |
| IPv4: IP payload csum |
| IPv6: header and payload csum |
| (Only valid is RX_FLAGS:csum calc is set) |
| RX_FRAG_ADDR 8 DMA address of packet fragment |
| RX_FRAG_MAX_LEN 2 Packet maximum fragment length |
| RX_FRAG_LEN 2 Actual packet fragment length after receive |
| |
| Offload forward RX_FLAG indicates the device has already forwarded the packet |
| so the host CPU should not also forward the packet. |
| |
| Possible status return codes in descriptor on completion are: |
| |
| DESC_COMP_ERR reason |
| -------------------------------------------------------------------- |
| 0 OK |
| -ROCKER_ENXIO address or data read err on desc buf |
| -ROCKER_ENOMEM no memory for internal staging desc buf |
| -ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain |
| packet data TLV and other TLVs. |
| |
| |
| SECTION 10: OF-DPA Mode |
| ====================== |
| |
| OF-DPA mode allows the switch to offload flow packet processing functions to |
| hardware. An OpenFlow controller would communicate with an OpenFlow agent |
| installed on the switch. The OpenFlow agent would (directly or indirectly) |
| communicate with the Rocker switch driver, which in turn would program switch |
| hardware with flow functionality, as defined in OF-DPA. The block diagram is: |
| |
| +–––––––––––––––----–––+ |
| | OF | |
| | Remote Controller | |
| +––––––––+––----–––––––+ |
| | |
| | |
| +––––––––+–––––––––+ |
| | OF | |
| | Local Agent | |
| +––––––––––––––––––+ |
| | | |
| | Rocker Driver | |
| +––––––––––––––––––+ |
| <this spec> |
| +––––––––––––––––––+ |
| | | |
| | Rocker Switch | |
| +––––––––––––––––––+ |
| |
| To participate in flow functions, ports must be configure for OF-DPA mode |
| during switch initialization. |
| |
| OF-DPA Flow Table Interface |
| --------------------------- |
| |
| There are commands to add, modify, delete, and get stats of flow table entries. |
| The commands are issued using the DMA CMD descriptor ring. The following |
| commands are defined: |
| |
| CMD_ADD: add an entry to flow table |
| CMD_MOD: modify an entry in flow table |
| CMD_DEL: delete an entry from flow table |
| CMD_GET_STATS: get stats for flow entry |
| |
| TLVs for add and modify commands are: |
| |
| field width description |
| ---------------------------------------------------- |
| OF_DPA_CMD 2 CMD_[ADD|MOD] |
| OF_DPA_TBL 2 Flow table ID |
| 0: ingress port |
| 10: vlan |
| 20: termination mac |
| 30: unicast routing |
| 40: multicast routing |
| 50: bridging |
| 60: ACL policy |
| OF_DPA_PRIORITY 4 Flow priority |
| OF_DPA_HARDTIME 4 Hard timeout for flow |
| OF_DPA_IDLETIME 4 Idle timeout for flow |
| OF_DPA_COOKIE 8 Cookie |
| |
| Additional TLVs based on flow table ID: |
| |
| Table ID 0: ingress port |
| |
| field width description |
| ---------------------------------------------------- |
| OF_DPA_IN_PPORT 4 ingress physical port number |
| OF_DPA_GOTO_TBL 2 goto table ID; zero to drop |
| |
| Table ID 10: vlan |
| |
| field width description |
| ---------------------------------------------------- |
| OF_DPA_IN_PPORT 4 ingress physical port number |
| OF_DPA_VLAN_ID 2 (N) vlan ID |
| OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask |
| OF_DPA_GOTO_TBL 2 goto table ID; zero to drop |
| OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID |
| |
| Table ID 20: termination mac |
| |
| field width description |
| ---------------------------------------------------- |
| OF_DPA_IN_PPORT 4 ingress physical port number |
| OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask |
| OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd |
| OF_DPA_DST_MAC 6 (N) destination MAC |
| OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask |
| OF_DPA_VLAN_ID 2 (N) vlan ID |
| OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask |
| OF_DPA_GOTO_TBL 2 only acceptable values are |
| unicast or multicast routing |
| table IDs |
| OF_DPA_OUT_PPORT 2 if specified, must be |
| controller, set zero otherwise |
| |
| Table ID 30: unicast routing |
| |
| field width description |
| ---------------------------------------------------- |
| OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd |
| OF_DPA_DST_IP 4 (N) destination IPv4 address. |
| Must be unicast address |
| OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask |
| OF_DPA_DST_IPV6 16 (N) destination IPv6 address. |
| Must be unicast address |
| OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask |
| OF_DPA_GOTO_TBL 2 goto table ID; zero to drop |
| OF_DPA_GROUP_ID 4 data for GROUP action must |
| be an L3 Unicast group entry |
| |
| Table ID 40: multicast routing |
| |
| field width description |
| ---------------------------------------------------- |
| OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd |
| OF_DPA_VLAN_ID 2 (N) vlan ID |
| OF_DPA_SRC_IP 4 (N) source IPv4. Optional, |
| can contain IPv4 address, |
| must be completely masked |
| if not used |
| OF_DPA_SRC_IP_MASK 4 (N) IP Mask |
| OF_DPA_DST_IP 4 (N) destination IPv4 address. |
| Must be multicast address |
| OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. |
| Can contain IPv6 address, |
| must be completely masked |
| if not used |
| OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask. |
| OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must |
| be multicast address |
| Must be multicast address |
| OF_DPA_GOTO_TBL 2 goto table ID; zero to drop |
| OF_DPA_GROUP_ID 4 data for GROUP action must |
| be an L3 multicast group entry |
| |
| Table ID 50: bridging |
| |
| field width description |
| ---------------------------------------------------- |
| OF_DPA_VLAN_ID 2 (N) vlan ID |
| OF_DPA_TUNNEL_ID 4 tunnel ID |
| OF_DPA_DST_MAC 6 (N) destination MAC |
| OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask |
| OF_DPA_GOTO_TBL 2 goto table ID; zero to drop |
| OF_DPA_GROUP_ID 4 data for GROUP action must |
| be a L2 Interface, L2 |
| Multicast, L2 Flood, |
| or L2 Overlay group entry |
| as appropriate |
| OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging |
| flows specify a tunnel |
| logical port ID |
| OF_DPA_OUT_PPORT 2 data for OUTPUT action, |
| restricted to CONTROLLER, |
| set to 0 otherwise |
| |
| Table ID 60: acl policy |
| |
| field width description |
| ---------------------------------------------------- |
| OF_DPA_IN_PPORT 4 ingress physical port number |
| OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask |
| OF_DPA_ETHERTYPE 2 (N) ethertype |
| OF_DPA_VLAN_ID 2 (N) vlan ID |
| OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask |
| OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point |
| OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask |
| OF_DPA_SRC_MAC 6 (N) source MAC |
| OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask |
| OF_DPA_DST_MAC 6 (N) destination MAC |
| OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask |
| OF_DPA_TUNNEL_ID 4 tunnel ID |
| OF_DPA_SRC_IP 4 (N) source IPv4. Optional, |
| can contain IPv4 address, |
| must be completely masked |
| if not used |
| OF_DPA_SRC_IP_MASK 4 (N) IP Mask |
| OF_DPA_DST_IP 4 (N) destination IPv4 address. |
| Must be multicast address |
| OF_DPA_DST_IP_MASK 4 (N) IP Mask |
| OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. |
| Can contain IPv6 address, |
| must be completely masked |
| if not used |
| OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask |
| OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must |
| be multicast address. |
| OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask |
| OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP |
| payload. Only used if ethertype |
| == 0x0806. |
| OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask |
| OF_DPA_IP_PROTO 1 IP protocol |
| OF_DPA_IP_PROTO_MASK 1 IP protocol mask |
| OF_DPA_IP_DSCP 1 DSCP |
| OF_DPA_IP_DSCP_MASK 1 DSCP mask |
| OF_DPA_IP_ECN 1 ECN |
| OF_DPA_IP_ECN_MASK 1 ECN mask |
| OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for |
| TCP, UDP, or SCTP |
| OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask |
| OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for |
| TCP, UDP, or SCTP |
| OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask |
| OF_DPA_ICMP_TYPE 1 ICMP type, only if IP |
| protocol is 1 |
| OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask |
| OF_DPA_ICMP_CODE 1 ICMP code |
| OF_DPA_ICMP_CODE_MASK 1 ICMP code mask |
| OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label |
| OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask |
| OF_DPA_GROUP_ID 4 data for GROUP action |
| OF_DPA_QUEUE_ID_ACTION 1 write the queue ID |
| OF_DPA_NEW_QUEUE_ID 1 queue ID |
| OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority |
| OF_DPA_NEW_VLAN_PCP 1 VLAN priority |
| OF_DPA_IP_DSCP_ACTION 1 write the DSCP |
| OF_DPA_NEW_IP_DSCP 1 new DSCP |
| OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel |
| logical port, set to 0 |
| otherwise. |
| OF_DPA_OUT_PPORT 2 data for OUTPUT action, |
| restricted to CONTROLLER, |
| set to 0 otherwise |
| OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are |
| dropped (all other instructions |
| ignored) |
| |
| TLVs for flow delete and get stats command are: |
| |
| field width description |
| --------------------------------------------------- |
| OF_DPA_CMD 2 CMD_[DEL|GET_STATS] |
| OF_DPA_COOKIE 8 Cookie |
| |
| On completion of get stats command, the descriptor buffer is written back with |
| the following TLVs: |
| |
| field width description |
| --------------------------------------------------- |
| OF_DPA_STAT_DURATION 4 Flow duration |
| OF_DPA_STAT_RX_PKTS 8 Received packets |
| OF_DPA_STAT_TX_PKTS 8 Transmit packets |
| |
| Possible status return codes in descriptor on completion are: |
| |
| DESC_COMP_ERR command reason |
| -------------------------------------------------------------------- |
| 0 all OK |
| -ROCKER_EFAULT all head or tail index outside |
| of ring |
| -ROCKER_ENXIO all address or data read err on |
| desc buf |
| -ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't |
| big enough to contain write-back |
| TLVs |
| -ROCKER_EINVAL all invalid parameters passed in |
| -ROCKER_EEXIST ADD entry already exists |
| -ROCKER_ENOSPC ADD no space left in flow table |
| -ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid |
| |
| Group Table Interface |
| --------------------- |
| |
| There are commands to add, modify, delete, and get stats of group table |
| entries. The commands are issued using the DMA CMD descriptor ring. The |
| following commands are defined: |
| |
| CMD_ADD: add an entry to group table |
| CMD_MOD: modify an entry in group table |
| CMD_DEL: delete an entry from group table |
| CMD_GET_STATS: get stats for group entry |
| |
| TLVs for add and modify commands are: |
| |
| field width description |
| ----------------------------------------------------------- |
| FLOW_GROUP_CMD 2 CMD_[ADD|MOD] |
| FLOW_GROUP_ID 2 Flow group ID |
| FLOW_GROUP_TYPE 1 Group type: |
| 0: L2 interface |
| 1: L2 rewrite |
| 2: L3 unicast |
| 3: L2 multicast |
| 4: L2 flood |
| 5: L3 interface |
| 6: L3 multicast |
| 7: L3 ECMP |
| 8: L2 overlay |
| FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6) |
| FLOW_L2_PORT 2 Port (types 0) |
| FLOW_INDEX 4 Index (all types but 0) |
| FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8): |
| 0: Flood unicast tunnel |
| 1: Flood multicast tunnel |
| 2: Multicast unicast tunnel |
| 3: Multicast multicast tunnel |
| FLOW_GROUP_ACTION nest |
| FLOW_GROUP_ID 2 next group ID in chain (all |
| types except 0) |
| FLOW_OUT_PORT 4 egress port (types 0, 8) |
| FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1 |
| only) |
| FLOW_VLAN_ID 2 (types 1, 5) |
| FLOW_SRC_MAC 6 (types 1, 2, 5) |
| FLOW_DST_MAC 6 (types 1, 2) |
| |
| TLVs for flow delete and get stats command are: |
| |
| field width description |
| ----------------------------------------------------------- |
| FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS] |
| FLOW_GROUP_ID 2 Flow group ID |
| |
| On completion of get stats command, the descriptor buffer is written back with |
| the following TLVs: |
| |
| field width description |
| --------------------------------------------------- |
| FLOW_GROUP_ID 2 Flow group ID |
| FLOW_STAT_DURATION 4 Flow duration |
| FLOW_STAT_REF_COUNT 4 Flow reference count |
| FLOW_STAT_BUCKET_COUNT 4 Flow bucket count |
| |
| Possible status return codes in descriptor on completion are: |
| |
| DESC_COMP_ERR command reason |
| -------------------------------------------------------------------- |
| 0 all OK |
| -ROCKER_EFAULT all head or tail index outside |
| of ring |
| -ROCKER_ENXIO all address or data read err on |
| desc buf |
| -ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't |
| big enough to contain write-back |
| TLVs |
| -ROCKER_EINVAL ADD|MOD invalid parameters passed in |
| -ROCKER_EEXIST ADD entry already exists |
| -ROCKER_ENOSPC ADD no space left in flow table |
| -ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid |
| -ROCKER_EBUSY DEL group reference count non-zero |
| -ROCKER_ENODEV ADD next group ID doesn't exist |
| |
| |
| |
| References |
| ========== |
| |
| [1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification, |
| Version 1.0, from Broadcom Corporation, February 21, 2014. |