audify

<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE rfc SYSTEM 'rfc2629.dtd'> <?rfc toc="yes" symrefs="yes" ?> <rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-14"> <front> <title abbrev="Interactive Audio Codec">Definition of the Opus Audio Codec</title> <author initials="JM" surname="Valin" fullname="Jean-Marc Valin"> <organization>Mozilla Corporation</organization> <address> <postal> <street>650 Castro Street</street> <city>Mountain View</city> <region>CA</region> <code>94041</code> <country>USA</country> </postal> <phone>+1 650 903-0800</phone> <email>jmvalin@jmvalin.ca</email> </address> </author> <author initials="K." surname="Vos" fullname="Koen Vos"> <organization>Skype Technologies S.A.</organization> <address> <postal> <street>Soder Malarstrand 43</street> <city>Stockholm</city> <region></region> <code>11825</code> <country>SE</country> </postal> <phone>+46 73 085 7619</phone> <email>koen.vos@skype.net</email> </address> </author> <author initials="T." surname="Terriberry" fullname="Timothy B. Terriberry"> <organization>Mozilla Corporation</organization> <address> <postal> <street>650 Castro Street</street> <city>Mountain View</city> <region>CA</region> <code>94041</code> <country>USA</country> </postal> <phone>+1 650 903-0800</phone> <email>tterriberry@mozilla.com</email> </address> </author> <date day="17" month="May" year="2012" /> <area>General</area> <workgroup></workgroup> <abstract> <t> This document defines the Opus interactive speech and audio codec. Opus is designed to handle a wide range of interactive audio applications, including Voice over IP, videoconferencing, in-game chat, and even live, distributed music performances. It scales from low bitrate narrowband speech at 6 kb/s to very high quality stereo music at 510 kb/s. Opus uses both linear prediction (LP) and the Modified Discrete Cosine Transform (MDCT) to achieve good compression of both speech and music. </t> </abstract> </front> <middle> <section anchor="introduction" title="Introduction"> <t> The Opus codec is a real-time interactive audio codec designed to meet the requirements described in <xref target="requirements"></xref>. It is composed of a linear prediction (LP)-based <xref target="LPC"/> layer and a Modified Discrete Cosine Transform (MDCT)-based <xref target="MDCT"/> layer. The main idea behind using two layers is that in speech, linear prediction techniques (such as Code-Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform (e.g., MDCT) domain techniques, while the situation is reversed for music and higher speech frequencies. Thus a codec with both layers available can operate over a wider range than either one alone and, by combining them, achieve better quality than either one individually. </t> <t> The primary normative part of this specification is provided by the source code in <xref target="ref-implementation"></xref>. Only the decoder portion of this software is normative, though a significant amount of code is shared by both the encoder and decoder. <xref target="conformance"/> provides a decoder conformance test. The decoder contains a great deal of integer and fixed-point arithmetic which needs to be performed exactly, including all rounding considerations, so any useful specification requires domain-specific symbolic language to adequately define these operations. Additionally, any conflict between the symbolic representation and the included reference implementation must be resolved. For the practical reasons of compatibility and testability it would be advantageous to give the reference implementation priority in any disagreement. The C language is also one of the most widely understood human-readable symbolic representations for machine behavior. For these reasons this RFC uses the reference implementation as the sole symbolic representation of the codec. </t> <t>While the symbolic representation is unambiguous and complete it is not always the easiest way to understand the codec's operation. For this reason this document also describes significant parts of the codec in English and takes the opportunity to explain the rationale behind many of the more surprising elements of the design. These descriptions are intended to be accurate and informative, but the limitations of common English sometimes result in ambiguity, so it is expected that the reader will always read them alongside the symbolic representation. Numerous references to the implementation are provided for this purpose. The descriptions sometimes differ from the reference in ordering or through mathematical simplification wherever such deviation makes an explanation easier to understand. For example, the right shift and left shift operations in the reference implementation are often described using division and multiplication in the text. In general, the text is focused on the "what" and "why" while the symbolic representation most clearly provides the "how". </t> <section anchor="notation" title="Notation and Conventions"> <t> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 <xref target="rfc2119"></xref>. </t> <t> Various operations in the codec require bit-exact fixed-point behavior, even when writing a floating point implementation. The notation "Q<n>", where n is an integer, denotes the number of binary digits to the right of the decimal point in a fixed-point number. For example, a signed Q14 value in a 16-bit word can represent values from -2.0 to 1.99993896484375, inclusive. This notation is for informational purposes only. Arithmetic, when described, always operates on the underlying integer. E.g., the text will explicitly indicate any shifts required after a multiplication. </t> <t> Expressions, where included in the text, follow C operator rules and precedence, with the exception that the syntax "x**y" indicates x raised to the power y. The text also makes use of the following functions: </t> <section anchor="min" toc="exclude" title="min(x,y)"> <t> The smallest of two values x and y. </t> </section> <section anchor="max" toc="exclude" title="max(x,y)"> <t> The largest of two values x and y. </t> </section> <section anchor="clamp" toc="exclude" title="clamp(lo,x,hi)"> <figure align="center"> <artwork align="center"><![CDATA[ clamp(lo,x,hi) = max(lo,min(x,hi)) ]]></artwork> </figure> <t> With this definition, if lo > hi, the lower bound is the one that is enforced. </t> </section> <section anchor="sign" toc="exclude" title="sign(x)"> <t> The sign of x, i.e., <figure align="center"> <artwork align="center"><![CDATA[ ( -1, x < 0 , sign(x) = < 0, x == 0 , ( 1, x > 0 . ]]></artwork> </figure> </t> </section> <section anchor="abs" toc="exclude" title="abs(x)"> <t> The absolute value of x, i.e., <figure align="center"> <artwork align="center"><![CDATA[ abs(x) = sign(x)*x . ]]></artwork> </figure> </t> </section> <section anchor="floor" toc="exclude" title="floor(f)"> <t> The largest integer z such that z <= f. </t> </section> <section anchor="ceil" toc="exclude" title="ceil(f)"> <t> The smallest integer z such that z >= f. </t> </section> <section anchor="round" toc="exclude" title="round(f)"> <t> The integer z nearest to f, with ties rounded towards negative infinity, i.e., <figure align="center"> <artwork align="center"><![CDATA[ round(f) = ceil(f - 0.5) . ]]></artwork> </figure> </t> </section> <section anchor="log2" toc="exclude" title="log2(f)"> <t> The base-two logarithm of f. </t> </section> <section anchor="ilog" toc="exclude" title="ilog(n)"> <t> The minimum number of bits required to store a positive integer n in two's complement notation, or 0 for a non-positive integer n. <figure align="center"> <artwork align="center"><![CDATA[ ( 0, n <= 0, ilog(n) = < ( floor(log2(n))+1, n > 0 ]]></artwork> </figure> Examples: <list style="symbols"> <t>ilog(-1) = 0</t> <t>ilog(0) = 0</t> <t>ilog(1) = 1</t> <t>ilog(2) = 2</t> <t>ilog(3) = 2</t> <t>ilog(4) = 3</t> <t>ilog(7) = 3</t> </list> </t> </section> </section> </section> <section anchor="overview" title="Opus Codec Overview"> <t> The Opus codec scales from 6 kb/s narrowband mono speech to 510 kb/s fullband stereo music, with algorithmic delays ranging from 5 ms to 65.2 ms. At any given time, either the LP layer, the MDCT layer, or both, may be active. It can seamlessly switch between all of its various operating modes, giving it a great deal of flexibility to adapt to varying content and network conditions without renegotiating the current session. The codec allows input and output of various audio bandwidths, defined as follows: </t> <texttable anchor="audio-bandwidth"> <ttcol>Abbreviation</ttcol> <ttcol align="right">Audio Bandwidth</ttcol> <ttcol align="right">Sample Rate (Effective)</ttcol> <c>NB (narrowband)</c> <c>4 kHz</c> <c>8 kHz</c> <c>MB (medium-band)</c> <c>6 kHz</c> <c>12 kHz</c> <c>WB (wideband)</c> <c>8 kHz</c> <c>16 kHz</c> <c>SWB (super-wideband)</c> <c>12 kHz</c> <c>24 kHz</c> <c>FB (fullband)</c> <c>20 kHz (*)</c> <c>48 kHz</c> </texttable> <t> (*) Although the sampling theorem allows a bandwidth as large as half the sampling rate, Opus never codes audio above 20 kHz, as that is the generally accepted upper limit of human hearing. </t> <t> Opus defines super-wideband (SWB) with an effective sample rate of 24 kHz, unlike some other audio coding standards that use 32 kHz. This was chosen for a number of reasons. The band layout in the MDCT layer naturally allows skipping coefficients for frequencies over 12 kHz, but does not allow cleanly dropping just those frequencies over 16 kHz. A sample rate of 24 kHz also makes resampling in the MDCT layer easier, as 24 evenly divides 48, and when 24 kHz is sufficient, it can save computation in other processing, such as Acoustic Echo Cancellation (AEC). Experimental changes to the band layout to allow a 16 kHz cutoff (32 kHz effective sample rate) showed potential quality degradations at other sample rates, and at typical bitrates the number of bits saved by using such a cutoff instead of coding in fullband (FB) mode is very small. Therefore, if an application wishes to process a signal sampled at 32 kHz, it should just use FB. </t> <t> The LP layer is based on the SILK codec <xref target="SILK"></xref>. It supports NB, MB, or WB audio and frame sizes from 10 ms to 60 ms, and requires an additional 5 ms look-ahead for noise shaping estimation. A small additional delay (up to 1.5 ms) may be required for sampling rate conversion. Like Vorbis <xref target='Vorbis-website'/> and many other modern codecs, SILK is inherently designed for variable-bitrate (VBR) coding, though the encoder can also produce constant-bitrate (CBR) streams. The version of SILK used in Opus is substantially modified from, and not compatible with, the stand-alone SILK codec previously deployed by Skype. This document does not serve to define that format, but those interested in the original SILK codec should see <xref target="SILK"/> instead. </t> <t> The MDCT layer is based on the CELT codec <xref target="CELT"></xref>. It supports NB, WB, SWB, or FB audio and frame sizes from 2.5 ms to 20 ms, and requires an additional 2.5 ms look-ahead due to the overlapping MDCT windows. The CELT codec is inherently designed for CBR coding, but unlike many CBR codecs it is not limited to a set of predetermined rates. It internally allocates bits to exactly fill any given target budget, and an encoder can produce a VBR stream by varying the target on a per-frame basis. The MDCT layer is not used for speech when the audio bandwidth is WB or less, as it is not useful there. On the other hand, non-speech signals are not always adequately coded using linear prediction, so for music only the MDCT layer should be used. </t> <t> A "Hybrid" mode allows the use of both layers simultaneously with a frame size of 10 or 20 ms and a SWB or FB audio bandwidth. The LP layer codes the low frequencies by resampling the signal down to WB. The MDCT layer follows, coding the high frequency portion of the signal. The cutoff between the two lies at 8 kHz, the maximum WB audio bandwidth. In the MDCT layer, all bands below 8 kHz are discarded, so there is no coding redundancy between the two layers. </t> <t> The sample rate (in contrast to the actual audio bandwidth) can be chosen independently on the encoder and decoder side, e.g., a fullband signal can be decoded as wideband, or vice versa. This approach ensures a sender and receiver can always interoperate, regardless of the capabilities of their actual audio hardware. Internally, the LP layer always operates at a sample rate of twice the audio bandwidth, up to a maximum of 16 kHz, which it continues to use for SWB and FB. The decoder simply resamples its output to support different sample rates. The MDCT layer always operates internally at a sample rate of 48 kHz. Since all the supported sample rates evenly divide this rate, and since the the decoder may easily zero out the high frequency portion of the spectrum in the frequency domain, it can simply decimate the MDCT layer output to achieve the other supported sample rates very cheaply. </t> <t> After conversion to the common, desired output sample rate, the decoder simply adds the output from the two layers together. To compensate for the different look-ahead required by each layer, the CELT encoder input is delayed by an additional 2.7 ms. This ensures that low frequencies and high frequencies arrive at the same time. This extra delay may be reduced by an encoder by using less look-ahead for noise shaping or using a simpler resampler in the LP layer, but this will reduce quality. However, the base 2.5 ms look-ahead in the CELT layer cannot be reduced in the encoder because it is needed for the MDCT overlap, whose size is fixed by the decoder. </t> <t> Both layers use the same entropy coder, avoiding any waste from "padding bits" between them. The hybrid approach makes it easy to support both CBR and VBR coding. Although the LP layer is VBR, the bit allocation of the MDCT layer can produce a final stream that is CBR by using all the bits left unused by the LP layer. </t> <section title="Control Parameters"> <t> The Opus codec includes a number of control parameters which can be changed dynamically during regular operation of the codec, without interrupting the audio stream from the encoder to the decoder. These parameters only affect the encoder since any impact they have on the bit-stream is signaled in-band such that a decoder can decode any Opus stream without any out-of-band signaling. Any Opus implementation can add or modify these control parameters without affecting interoperability. The most important encoder control parameters in the reference encoder are listed below. </t> <section title="Bitrate" toc="exlcude"> <t> Opus supports all bitrates from 6 kb/s to 510 kb/s. All other parameters being equal, higher bitrate results in higher quality. For a frame size of 20 ms, these are the bitrate "sweet spots" for Opus in various configurations: <list style="symbols"> <t>8-12 kb/s for NB speech,</t> <t>16-20 kb/s for WB speech,</t> <t>28-40 kb/s for FB speech,</t> <t>48-64 kb/s for FB mono music, and</t> <t>64-128 kb/s for FB stereo music.</t> </list> </t> </section> <section title="Number of Channels (Mono/Stereo)" toc="exlcude"> <t> Opus can transmit either mono or stereo frames within a single stream. When decoding a mono frame in a stereo decoder, the left and right channels are identical, and when decoding a stereo frame in a mono decoder, the mono output is the average of the left and right channels. In some cases, it is desirable to encode a stereo input stream in mono (e.g., because the bitrate is too low to encode stereo with sufficient quality). The number of channels encoded can be selected in real-time, but by default the reference encoder attempts to make the best decision possible given the current bitrate. </t> </section> <section title="Audio Bandwidth" toc="exlcude"> <t> The audio bandwidths supported by Opus are listed in <xref target="audio-bandwidth"/>. Just like for the number of channels, any decoder can decode audio encoded at any bandwidth. For example, any Opus decoder operating at 8 kHz can decode a FB Opus frame, and any Opus decoder operating at 48 kHz can decode a NB frame. Similarly, the reference encoder can take a 48 kHz input signal and encode it as NB. The higher the audio bandwidth, the higher the required bitrate to achieve acceptable quality. The audio bandwidth can be explicitly specified in real-time, but by default the reference encoder attempts to make the best bandwidth decision possible given the current bitrate. </t> </section> <section title="Frame Duration" toc="exlcude"> <t> Opus can encode frames of 2.5, 5, 10, 20, 40 or 60 ms. It can also combine multiple frames into packets of up to 120 ms. For real-time applications, sending fewer packets per second reduces the bitrate, since it reduces the overhead from IP, UDP, and RTP headers. However, it increases latency and sensitivity to packet losses, as losing one packet constitutes a loss of a bigger chunk of audio. Increasing the frame duration also slightly improves coding efficiency, but the gain becomes small for frame sizes above 20 ms. For this reason, 20 ms frames are a good choice for most applications. </t> </section> <section title="Complexity" toc="exlcude"> <t> There are various aspects of the Opus encoding process where trade-offs can be made between CPU complexity and quality/bitrate. In the reference encoder, the complexity is selected using an integer from 0 to 10, where 0 is the lowest complexity and 10 is the highest. Examples of computations for which such trade-offs may occur are: <list style="symbols"> <t>The order of the pitch analysis whitening filter <xref target="Whitening"/>,</t> <t>The order of the short-term noise shaping filter,</t> <t>The number of states in delayed decision quantization of the residual signal, and</t> <t>The use of certain bit-stream features such as variable time-frequency resolution and the pitch post-filter.</t> </list> </t> </section> <section title="Packet Loss Resilience" toc="exlcude"> <t> Audio codecs often exploit inter-frame correlations to reduce the bitrate at a cost in error propagation: after losing one packet several packets need to be received before the decoder is able to accurately reconstruct the speech signal. The extent to which Opus exploits inter-frame dependencies can be adjusted on the fly to choose a trade-off between bitrate and amount of error propagation. </t> </section> <section title="Forward Error Correction (FEC)" toc="exlcude"> <t> Another mechanism providing robustness against packet loss is the in-band Forward Error Correction (FEC). Packets that are determined to contain perceptually important speech information, such as onsets or transients, are encoded again at a lower bitrate and this re-encoded information is added to a subsequent packet. </t> </section> <section title="Constant/Variable Bitrate" toc="exlcude"> <t> Opus is more efficient when operating with variable bitrate (VBR), which is the default. However, in some (rare) applications, constant bitrate (CBR) is required. There are two main reasons to operate in CBR mode: <list style="symbols"> <t>When the transport only supports a fixed size for each compressed frame</t> <t>When encryption is used for an audio stream that is either highly constrained (e.g. yes/no, recorded prompts) or highly sensitive <xref target="SRTP-VBR"></xref> </t> </list> When low-latency transmission is required over a relatively slow connection, then constrained VBR can also be used. This uses VBR in a way that simulates a "bit reservoir" and is equivalent to what MP3 (MPEG 1, Layer 3) and AAC (Advanced Audio Coding) call CBR (i.e., not true CBR due to the bit reservoir). </t> </section> <section title="Discontinuous Transmission (DTX)" toc="exlcude"> <t> Discontinuous Transmission (DTX) reduces the bitrate during silence or background noise. When DTX is enabled, only one frame is encoded every 400 milliseconds. </t> </section> </section> </section> <section anchor="modes" title="Internal Framing"> <t> The Opus encoder produces "packets", which are each a contiguous set of bytes meant to be transmitted as a single unit. The packets described here do not include such things as IP, UDP, or RTP headers which are normally found in a transport-layer packet. A single packet may contain multiple audio frames, so long as they share a common set of parameters, including the operating mode, audio bandwidth, frame size, and channel count (mono vs. stereo). This section describes the possible combinations of these parameters and the internal framing used to pack multiple frames into a single packet. This framing is not self-delimiting. Instead, it assumes that a higher layer (such as UDP or RTP <xref target='RFC3550'/> or Ogg <xref target='RFC3533'/> or Matroska <xref target='Matroska-website'/>) will communicate the length, in bytes, of the packet, and it uses this information to reduce the framing overhead in the packet itself. A decoder implementation MUST support the framing described in this section. An alternative, self-delimiting variant of the framing is described in <xref target="self-delimiting-framing"/>. Support for that variant is OPTIONAL. </t> <t> All bit diagrams in this document number the bits so that bit 0 is the most significant bit of the first byte, and bit 7 is the least significant. Bit 8 is thus the most significant bit of the second byte, etc. Well-formed Opus packets obey certain requirements, marked [R1] through [R7] below. These are summarized in <xref target="malformed-packets"/> along with appropriate means of handling malformed packets. </t> <section anchor="toc_byte" title="The TOC Byte"> <t anchor="R1"> A well-formed Opus packet MUST contain at least one byte [R1]. This byte forms a table-of-contents (TOC) header that signals which of the various modes and configurations a given packet uses. It is composed of a configuration number, "config", a stereo flag, "s", and a frame count code, "c", arranged as illustrated in <xref target="toc_byte_fig"/>. A description of each of these fields follows. </t> <figure anchor="toc_byte_fig" title="The TOC Byte"> <artwork align="center"><![CDATA[ 0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ | config |s| c | +-+-+-+-+-+-+-+-+ ]]></artwork> </figure> <t> The top five bits of the TOC byte, labeled "config", encode one of 32 possible configurations of operating mode, audio bandwidth, and frame size. As described, the LP (SILK) layer and MDCT (CELT) layer can be combined in three possible operating modes: <list style="numbers"> <t>A SILK-only mode for use in low bitrate connections with an audio bandwidth of WB or less,</t> <t>A Hybrid (SILK+CELT) mode for SWB or FB speech at medium bitrates, and</t> <t>A CELT-only mode for very low delay speech transmission as well as music transmission (NB to FB).</t> </list> The 32 possible configurations each identify which one of these operating modes the packet uses, as well as the audio bandwidth and the frame size. <xref target="config_bits"/> lists the parameters for each configuration. </t> <texttable anchor="config_bits" title="TOC Byte Configuration Parameters"> <ttcol>Configuration Number(s)</ttcol> <ttcol>Mode</ttcol> <ttcol>Bandwidth</ttcol> <ttcol>Frame Sizes</ttcol> <c>0...3</c> <c>SILK-only</c> <c>NB</c> <c>10, 20, 40, 60 ms</c> <c>4...7</c> <c>SILK-only</c> <c>MB</c> <c>10, 20, 40, 60 ms</c> <c>8...11</c> <c>SILK-only</c> <c>WB</c> <c>10, 20, 40, 60 ms</c> <c>12...13</c> <c>Hybrid</c> <c>SWB</c> <c>10, 20 ms</c> <c>14...15</c> <c>Hybrid</c> <c>FB</c> <c>10, 20 ms</c> <c>16...19</c> <c>CELT-only</c> <c>NB</c> <c>2.5, 5, 10, 20 ms</c> <c>20...23</c> <c>CELT-only</c> <c>WB</c> <c>2.5, 5, 10, 20 ms</c> <c>24...27</c> <c>CELT-only</c> <c>SWB</c> <c>2.5, 5, 10, 20 ms</c> <c>28...31</c> <c>CELT-only</c> <c>FB</c> <c>2.5, 5, 10, 20 ms</c> </texttable> <t> The configuration numbers in each range (e.g., 0...3 for NB SILK-only) correspond to the various choices of frame size, in the same order. For example, configuration 0 has a 10 ms frame size and configuration 3 has a 60 ms frame size. </t> <t> One additional bit, labeled "s", signals mono vs. stereo, with 0 indicating mono and 1 indicating stereo. </t> <t> The remaining two bits of the TOC byte, labeled "c", code the number of frames per packet (codes 0 to 3) as follows: <list style="symbols"> <t>0: 1 frame in the packet</t> <t>1: 2 frames in the packet, each with equal compressed size</t> <t>2: 2 frames in the packet, with different compressed sizes</t> <t>3: an arbitrary number of frames in the packet</t> </list> This draft refers to a packet as a code 0 packet, code 1 packet, etc., based on the value of "c". </t> </section> <section title="Frame Packing"> <t> This section describes how frames are packed according to each possible value of "c" in the TOC byte. </t> <section anchor="frame-length-coding" title="Frame Length Coding"> <t> When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed length of one or more of these frames is indicated with a one- or two-byte sequence, with the meaning of the first byte as follows: <list style="symbols"> <t>0: No frame (discontinuous transmission (DTX) or lost packet)</t> <t>1...251: Length of the frame in bytes</t> <t>252...255: A second byte is needed. The total length is (second_byte*4)+first_byte</t> </list> </t> <t> The special length 0 indicates that no frame is available, either because it was dropped during transmission by some intermediary or because the encoder chose not to transmit it. Any Opus frame in any mode MAY have a length of 0. </t> <t> The maximum representable length is 255*4+255=1275 bytes. For 20 ms frames, this represents a bitrate of 510 kb/s, which is approximately the highest useful rate for lossily compressed fullband stereo music. Beyond this point, lossless codecs are more appropriate. It is also roughly the maximum useful rate of the MDCT layer, as shortly thereafter quality no longer improves with additional bits due to limitations on the codebook sizes. </t> <t anchor="R2"> No length is transmitted for the last frame in a VBR packet, or for any of the frames in a CBR packet, as it can be inferred from the total size of the packet and the size of all other data in the packet. However, the length of any individual frame MUST NOT exceed 1275 bytes [R2], to allow for repacketization by gateways, conference bridges, or other software. </t> </section> <section title="Code 0: One Frame in the Packet"> <t> For code 0 packets, the TOC byte is immediately followed by N-1 bytes of compressed data for a single frame (where N is the size of the packet), as illustrated in <xref target="code0_packet"/>. </t> <figure anchor="code0_packet" title="A Code 0 Packet" align="center"> <artwork align="center"><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|0|0| | +-+-+-+-+-+-+-+-+ | | Compressed frame 1 (N-1 bytes)... : : | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> </section> <section title="Code 1: Two Frames in the Packet, Each with Equal Compressed Size"> <t anchor="R3"> For code 1 packets, the TOC byte is immediately followed by the (N-1)/2 bytes of compressed data for the first frame, followed by (N-1)/2 bytes of compressed data for the second frame, as illustrated in <xref target="code1_packet"/>. The number of payload bytes available for compressed data, N-1, MUST be even for all code 1 packets [R3]. </t> <figure anchor="code1_packet" title="A Code 1 Packet" align="center"> <artwork align="center"><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|0|1| | +-+-+-+-+-+-+-+-+ : | Compressed frame 1 ((N-1)/2 bytes)... | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | Compressed frame 2 ((N-1)/2 bytes)... | : +-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> </section> <section title="Code 2: Two Frames in the Packet, with Different Compressed Sizes"> <t anchor="R4"> For code 2 packets, the TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame (marked N1 in <xref target='code2_packet'/>), followed by N1 bytes of compressed data for the first frame. The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the second frame. This is illustrated in <xref target="code2_packet"/>. A code 2 packet MUST contain enough bytes to represent a valid length. For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2 packet whose second byte is in the range 252...255 is also invalid. The length of the first frame, N1, MUST also be no larger than the size of the payload remaining after decoding that length for all code 2 packets [R4]. This makes, for example, a 2-byte code 2 packet with a second byte in the range 1...251 invalid as well (the only valid 2-byte code 2 packet is one where the length of both frames is zero). </t> <figure anchor="code2_packet" title="A Code 2 Packet" align="center"> <artwork align="center"><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|1|0| N1 (1-2 bytes): | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | Compressed frame 1 (N1 bytes)... | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Compressed frame 2... : : | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> </section> <section title="Code 3: A Signaled Number of Frames in the Packet"> <t anchor="R5"> Code 3 packets signal the number of frames, as well as additional padding, called "Opus padding" to indicate that this padding is added at the Opus layer, rather than at the transport layer. Code 3 packets MUST have at least 2 bytes [R6,R7]. The TOC byte is followed by a byte encoding the number of frames in the packet in bits 2 to 7 (marked "M" in <xref target='frame_count_byte'/>), with bit 1 indicating whether or not Opus padding is inserted (marked "p" in <xref target='frame_count_byte'/>), and bit 0 indicating VBR (marked "v" in <xref target='frame_count_byte'/>). M MUST NOT be zero, and the audio duration contained within a packet MUST NOT exceed 120 ms [R5]. This limits the maximum frame count for any frame size to 48 (for 2.5 ms frames), with lower limits for longer frame sizes. <xref target="frame_count_byte"/> illustrates the layout of the frame count byte. </t> <figure anchor="frame_count_byte" title="The frame count byte"> <artwork align="center"><![CDATA[ 0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |v|p| M | +-+-+-+-+-+-+-+-+ ]]></artwork> </figure> <t> When Opus padding is used, the number of bytes of padding is encoded in the bytes following the frame count byte. Values from 0...254 indicate that 0...254 bytes of padding are included, in addition to the byte(s) used to indicate the size of the padding. If the value is 255, then the size of the additional padding is 254 bytes, plus the padding value encoded in the next byte. There MUST be at least one more byte in the packet in this case [R6,R7]. The additional padding bytes appear at the end of the packet, and MUST be set to zero by the encoder to avoid creating a covert channel. The decoder MUST accept any value for the padding bytes, however. </t> <t> Although this encoding provides multiple ways to indicate a given number of padding bytes, each uses a different number of bytes to indicate the padding size, and thus will increase the total packet size by a different amount. For example, to add 255 bytes to a packet, set the padding bit, p, to 1, insert a single byte after the frame count byte with a value of 254, and append 254 padding bytes with the value zero to the end of the packet. To add 256 bytes to a packet, set the padding bit to 1, insert two bytes after the frame count byte with the values 255 and 0, respectively, and append 254 padding bytes with the value zero to the end of the packet. By using the value 255 multiple times, it is possible to create a packet of any specific, desired size. Let P be the number of header bytes used to indicate the padding size plus the number of padding bytes themselves (i.e., P is the total number of bytes added to the packet). Then P MUST be no more than N-2 [R6,R7]. </t> <t anchor="R6"> In the CBR case, let R=N-2-P be the number of bytes remaining in the packet after subtracting the (optional) padding. Then the compressed length of each frame in bytes is equal to R/M. The value R MUST be a non-negative integer multiple of M [R6]. The compressed data for all M frames follows, each of size R/M bytes, as illustrated in <xref target="code3cbr_packet"/>. </t> <figure anchor="code3cbr_packet" title="A CBR Code 3 Packet" align="center"> <artwork align="center"><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|1|1|0|p| M | Padding length (Optional) : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame 1 (R/M bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame 2 (R/M bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : ... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame M (R/M bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : Opus Padding (Optional)... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> <t anchor="R7"> In the VBR case, the (optional) padding length is followed by M-1 frame lengths (indicated by "N1" to "N[M-1]" in <xref target='code3vbr_packet'/>), each encoded in a one- or two-byte sequence as described above. The packet MUST contain enough data for the M-1 lengths after removing the (optional) padding, and the sum of these lengths MUST be no larger than the number of bytes remaining in the packet after decoding them [R7]. The compressed data for all M frames follows, each frame consisting of the indicated number of bytes, with the final frame consuming any remaining bytes before the final padding, as illustrated in <xref target="code3cbr_packet"/>. The number of header bytes (TOC byte, frame count byte, padding length bytes, and frame length bytes), plus the signaled length of the first M-1 frames themselves, plus the signaled length of the padding MUST be no larger than N, the total size of the packet. </t> <figure anchor="code3vbr_packet" title="A VBR Code 3 Packet" align="center"> <artwork align="center"><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|1|1|1|p| M | Padding length (Optional) : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : N1 (1-2 bytes): N2 (1-2 bytes): ... : N[M-1] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame 1 (N1 bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame 2 (N2 bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : ... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame M... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : Opus Padding (Optional)... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> </section> </section> <section anchor="examples" title="Examples"> <t> Simplest case, one NB mono 20 ms SILK frame: </t> <figure anchor='framing_example_1'> <artwork><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1 |0|0|0| compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> <t> Two FB mono 5 ms CELT frames of the same compressed size: </t> <figure anchor='framing_example_2'> <artwork><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 29 |0|0|1| compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> <t> Two FB mono 20 ms Hybrid frames of different compressed size: </t> <figure anchor='framing_example_3'> <artwork><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 15 |0|1|1|1|0| 2 | N1 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> <t> Four FB stereo 20 ms CELT frames of the same compressed size: </t> <figure anchor='framing_example_4'> <artwork><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 31 |1|1|1|0|0| 4 | compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> </section> <section anchor="malformed-packets" title="Receiving Malformed Packets"> <t> A receiver MUST NOT process packets which violate any of the rules above as normal Opus packets. They are reserved for future applications, such as in-band headers (containing metadata, etc.). Packets which violate these constraints may cause implementations of <spanx style="emph">this</spanx> specification to treat them as malformed, and discard them. </t> <t> These constraints are summarized here for reference: <list style="format [R%d]"> <t>Packets are at least one byte.</t> <t>No implicit frame length is larger than 1275 bytes.</t> <t>Code 1 packets have an odd total length, N, so that (N-1)/2 is an integer.</t> <t>Code 2 packets have enough bytes after the TOC for a valid frame length, and that length is no larger than the number of bytes remaining in the packet.</t> <t>Code 3 packets contain at least one frame, but no more than 120 ms of audio total.</t> <t>The length of a CBR code 3 packet, N, is at least two bytes, the number of bytes added to indicate the padding size plus the trailing padding bytes themselves, P, is no more than N-2, and the frame count, M, satisfies the constraint that (N-2-P) is a non-negative integer multiple of M.</t> <t>VBR code 3 packets are large enough to contain all the header bytes (TOC byte, frame count byte, any padding length bytes, and any frame length bytes), plus the length of the first M-1 frames, plus any trailing padding bytes.</t> </list> </t> </section> </section> <section title="Opus Decoder"> <t> The Opus decoder consists of two main blocks: the SILK decoder and the CELT decoder. At any given time, one or both of the SILK and CELT decoders may be active. The output of the Opus decode is the sum of the outputs from the SILK and CELT decoders with proper sample rate conversion and delay compensation on the SILK side, and optional decimation (when decoding to sample rates less than 48 kHz) on the CELT side, as illustrated in the block diagram below. </t> <figure> <artwork> <![CDATA[ +---------+ +------------+ | SILK | | Sample | +->| Decoder |--->| Rate |----+ Bit- +---------+ | | | | Conversion | v stream | Range |---+ +---------+ +------------+ /---\ Audio ------->| Decoder | | + |------> | |---+ +---------+ +------------+ \---/ +---------+ | | CELT | | Decimation | ^ +->| Decoder |--->| (Optional) |----+ | | | | +---------+ +------------+ ]]> </artwork> </figure> <section anchor="range-decoder" title="Range Decoder"> <t> Opus uses an entropy coder based on range coding <xref target="range-coding"></xref> <xref target="Martin79"></xref>, which is itself a rediscovery of the FIFO arithmetic code introduced by <xref target="coding-thesis"></xref>. It is very similar to arithmetic encoding, except that encoding is done with digits in any base instead of with bits, so it is faster when using larger bases (i.e., a byte). All of the calculations in the range coder must use bit-exact integer arithmetic. </t> <t> Symbols may also be coded as "raw bits" packed directly into the bitstream, bypassing the range coder. These are packed backwards starting at the end of the frame, as illustrated in <xref target="rawbits-example"/>. This reduces complexity and makes the stream more resilient to bit errors, as corruption in the raw bits will not desynchronize the decoding process, unlike corruption in the input to the range decoder. Raw bits are only used in the CELT layer. </t> <figure anchor="rawbits-example" title="Illustrative example of packing range coder and raw bits data"> <artwork align="center"><![CDATA[ 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Range coder data (packed MSB to LSB) -> : + + : : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | <- Boundary occurs at an arbitrary bit position : +-+-+-+ + : <- Raw bits data (packed LSB to MSB) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ]]></artwork> </figure> <t> Each symbol coded by the range coder is drawn from a finite alphabet and coded in a separate "context", which describes the size of the alphabet and the relative frequency of each symbol in that alphabet. </t> <t> Suppose there is a context with n symbols, identified with an index that ranges from 0 to n-1. The parameters needed to encode or decode symbol k in this context are represented by a three-tuple (fl[k], fh[k], ft), with 0 <= fl[k] < fh[k] <= ft <= 65535. The values of this tuple are derived from the probability model for the symbol, represented by traditional "frequency counts". Because Opus uses static contexts these are not updated as symbols are decoded. Let f[i] be the frequency of symbol i. Then the three-tuple corresponding to symbol k is given by </t> <figure align="center"> <artwork align="center"><![CDATA[ k-1 n-1 __ __ fl[k] = \ f[i], fh[k] = fl[k] + f[k], ft = \ f[i] /_ /_ i=0 i=0 ]]></artwork> </figure> <t> The range decoder extracts the symbols and integers encoded using the range encoder in <xref target="range-encoder"/>. The range decoder maintains an internal state vector composed of the two-tuple (val, rng), representing the difference between the high end of the current range and the actual coded value, minus one, and the size of the current range, respectively. Both val and rng are 32-bit unsigned integer values. </t> <section anchor="range-decoder-init" title="Range Decoder Initialization"> <t> Let b0 be the first input byte (or zero if there are no bytes in this Opus frame). The decoder initializes rng to 128 and initializes val to (127 - (b0>>1)), where (b0>>1) is the top 7 bits of the first input byte. It saves the remaining bit, (b0&1), for use in the renormalization procedure described in <xref target="range-decoder-renorm"/>, which the decoder invokes immediately after initialization to read additional bits and establish the invariant that rng > 2**23. </t> </section> <section anchor="decoding-symbols" title="Decoding Symbols"> <t> Decoding a symbol is a two-step process. The first step determines a 16-bit unsigned value fs, which lies within the range of some symbol in the current context. The second step updates the range decoder state with th