audify
Version:
Play/Stream/Record PCM audio data & Encode/Decode Opus to PCM audio data
1,133 lines (1,044 loc) • 380 kB
text/xml
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
<?rfc toc="yes" symrefs="yes" ?>
<rfc ipr="trust200902" category="std" docName="draft-ietf-codec-opus-14">
<front>
<title abbrev="Interactive Audio Codec">Definition of the Opus Audio Codec</title>
<author initials="JM" surname="Valin" fullname="Jean-Marc Valin">
<organization>Mozilla Corporation</organization>
<address>
<postal>
<street>650 Castro Street</street>
<city>Mountain View</city>
<region>CA</region>
<code>94041</code>
<country>USA</country>
</postal>
<phone>+1 650 903-0800</phone>
<email>jmvalin@jmvalin.ca</email>
</address>
</author>
<author initials="K." surname="Vos" fullname="Koen Vos">
<organization>Skype Technologies S.A.</organization>
<address>
<postal>
<street>Soder Malarstrand 43</street>
<city>Stockholm</city>
<region></region>
<code>11825</code>
<country>SE</country>
</postal>
<phone>+46 73 085 7619</phone>
<email>koen.vos@skype.net</email>
</address>
</author>
<author initials="T." surname="Terriberry" fullname="Timothy B. Terriberry">
<organization>Mozilla Corporation</organization>
<address>
<postal>
<street>650 Castro Street</street>
<city>Mountain View</city>
<region>CA</region>
<code>94041</code>
<country>USA</country>
</postal>
<phone>+1 650 903-0800</phone>
<email>tterriberry@mozilla.com</email>
</address>
</author>
<date day="17" month="May" year="2012" />
<area>General</area>
<workgroup></workgroup>
<abstract>
<t>
This document defines the Opus interactive speech and audio codec.
Opus is designed to handle a wide range of interactive audio applications,
including Voice over IP, videoconferencing, in-game chat, and even live,
distributed music performances.
It scales from low bitrate narrowband speech at 6 kb/s to very high quality
stereo music at 510 kb/s.
Opus uses both linear prediction (LP) and the Modified Discrete Cosine
Transform (MDCT) to achieve good compression of both speech and music.
</t>
</abstract>
</front>
<middle>
<section anchor="introduction" title="Introduction">
<t>
The Opus codec is a real-time interactive audio codec designed to meet the requirements
described in <xref target="requirements"></xref>.
It is composed of a linear
prediction (LP)-based <xref target="LPC"/> layer and a Modified Discrete Cosine Transform
(MDCT)-based <xref target="MDCT"/> layer.
The main idea behind using two layers is that in speech, linear prediction
techniques (such as Code-Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform
(e.g., MDCT) domain techniques, while the situation is reversed for music and
higher speech frequencies.
Thus a codec with both layers available can operate over a wider range than
either one alone and, by combining them, achieve better quality than either
one individually.
</t>
<t>
The primary normative part of this specification is provided by the source code
in <xref target="ref-implementation"></xref>.
Only the decoder portion of this software is normative, though a
significant amount of code is shared by both the encoder and decoder.
<xref target="conformance"/> provides a decoder conformance test.
The decoder contains a great deal of integer and fixed-point arithmetic which
needs to be performed exactly, including all rounding considerations, so any
useful specification requires domain-specific symbolic language to adequately
define these operations.
Additionally, any
conflict between the symbolic representation and the included reference
implementation must be resolved. For the practical reasons of compatibility and
testability it would be advantageous to give the reference implementation
priority in any disagreement. The C language is also one of the most
widely understood human-readable symbolic representations for machine
behavior.
For these reasons this RFC uses the reference implementation as the sole
symbolic representation of the codec.
</t>
<t>While the symbolic representation is unambiguous and complete it is not
always the easiest way to understand the codec's operation. For this reason
this document also describes significant parts of the codec in English and
takes the opportunity to explain the rationale behind many of the more
surprising elements of the design. These descriptions are intended to be
accurate and informative, but the limitations of common English sometimes
result in ambiguity, so it is expected that the reader will always read
them alongside the symbolic representation. Numerous references to the
implementation are provided for this purpose. The descriptions sometimes
differ from the reference in ordering or through mathematical simplification
wherever such deviation makes an explanation easier to understand.
For example, the right shift and left shift operations in the reference
implementation are often described using division and multiplication in the text.
In general, the text is focused on the "what" and "why" while the symbolic
representation most clearly provides the "how".
</t>
<section anchor="notation" title="Notation and Conventions">
<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in RFC 2119 <xref target="rfc2119"></xref>.
</t>
<t>
Various operations in the codec require bit-exact fixed-point behavior, even
when writing a floating point implementation.
The notation "Q<n>", where n is an integer, denotes the number of binary
digits to the right of the decimal point in a fixed-point number.
For example, a signed Q14 value in a 16-bit word can represent values from
-2.0 to 1.99993896484375, inclusive.
This notation is for informational purposes only.
Arithmetic, when described, always operates on the underlying integer.
E.g., the text will explicitly indicate any shifts required after a
multiplication.
</t>
<t>
Expressions, where included in the text, follow C operator rules and
precedence, with the exception that the syntax "x**y" indicates x raised to
the power y.
The text also makes use of the following functions:
</t>
<section anchor="min" toc="exclude" title="min(x,y)">
<t>
The smallest of two values x and y.
</t>
</section>
<section anchor="max" toc="exclude" title="max(x,y)">
<t>
The largest of two values x and y.
</t>
</section>
<section anchor="clamp" toc="exclude" title="clamp(lo,x,hi)">
<figure align="center">
<artwork align="center"><![CDATA[
clamp(lo,x,hi) = max(lo,min(x,hi))
]]></artwork>
</figure>
<t>
With this definition, if lo > hi, the lower bound is the one that
is enforced.
</t>
</section>
<section anchor="sign" toc="exclude" title="sign(x)">
<t>
The sign of x, i.e.,
<figure align="center">
<artwork align="center"><![CDATA[
( -1, x < 0 ,
sign(x) = < 0, x == 0 ,
( 1, x > 0 .
]]></artwork>
</figure>
</t>
</section>
<section anchor="abs" toc="exclude" title="abs(x)">
<t>
The absolute value of x, i.e.,
<figure align="center">
<artwork align="center"><![CDATA[
abs(x) = sign(x)*x .
]]></artwork>
</figure>
</t>
</section>
<section anchor="floor" toc="exclude" title="floor(f)">
<t>
The largest integer z such that z <= f.
</t>
</section>
<section anchor="ceil" toc="exclude" title="ceil(f)">
<t>
The smallest integer z such that z >= f.
</t>
</section>
<section anchor="round" toc="exclude" title="round(f)">
<t>
The integer z nearest to f, with ties rounded towards negative infinity,
i.e.,
<figure align="center">
<artwork align="center"><![CDATA[
round(f) = ceil(f - 0.5) .
]]></artwork>
</figure>
</t>
</section>
<section anchor="log2" toc="exclude" title="log2(f)">
<t>
The base-two logarithm of f.
</t>
</section>
<section anchor="ilog" toc="exclude" title="ilog(n)">
<t>
The minimum number of bits required to store a positive integer n in two's
complement notation, or 0 for a non-positive integer n.
<figure align="center">
<artwork align="center"><![CDATA[
( 0, n <= 0,
ilog(n) = <
( floor(log2(n))+1, n > 0
]]></artwork>
</figure>
Examples:
<list style="symbols">
<t>ilog(-1) = 0</t>
<t>ilog(0) = 0</t>
<t>ilog(1) = 1</t>
<t>ilog(2) = 2</t>
<t>ilog(3) = 2</t>
<t>ilog(4) = 3</t>
<t>ilog(7) = 3</t>
</list>
</t>
</section>
</section>
</section>
<section anchor="overview" title="Opus Codec Overview">
<t>
The Opus codec scales from 6 kb/s narrowband mono speech to 510 kb/s
fullband stereo music, with algorithmic delays ranging from 5 ms to
65.2 ms.
At any given time, either the LP layer, the MDCT layer, or both, may be active.
It can seamlessly switch between all of its various operating modes, giving it
a great deal of flexibility to adapt to varying content and network
conditions without renegotiating the current session.
The codec allows input and output of various audio bandwidths, defined as
follows:
</t>
<texttable anchor="audio-bandwidth">
<ttcol>Abbreviation</ttcol>
<ttcol align="right">Audio Bandwidth</ttcol>
<ttcol align="right">Sample Rate (Effective)</ttcol>
<c>NB (narrowband)</c> <c>4 kHz</c> <c>8 kHz</c>
<c>MB (medium-band)</c> <c>6 kHz</c> <c>12 kHz</c>
<c>WB (wideband)</c> <c>8 kHz</c> <c>16 kHz</c>
<c>SWB (super-wideband)</c> <c>12 kHz</c> <c>24 kHz</c>
<c>FB (fullband)</c> <c>20 kHz (*)</c> <c>48 kHz</c>
</texttable>
<t>
(*) Although the sampling theorem allows a bandwidth as large as half the
sampling rate, Opus never codes audio above 20 kHz, as that is the
generally accepted upper limit of human hearing.
</t>
<t>
Opus defines super-wideband (SWB) with an effective sample rate of 24 kHz,
unlike some other audio coding standards that use 32 kHz.
This was chosen for a number of reasons.
The band layout in the MDCT layer naturally allows skipping coefficients for
frequencies over 12 kHz, but does not allow cleanly dropping just those
frequencies over 16 kHz.
A sample rate of 24 kHz also makes resampling in the MDCT layer easier,
as 24 evenly divides 48, and when 24 kHz is sufficient, it can save
computation in other processing, such as Acoustic Echo Cancellation (AEC).
Experimental changes to the band layout to allow a 16 kHz cutoff
(32 kHz effective sample rate) showed potential quality degradations at
other sample rates, and at typical bitrates the number of bits saved by using
such a cutoff instead of coding in fullband (FB) mode is very small.
Therefore, if an application wishes to process a signal sampled at 32 kHz,
it should just use FB.
</t>
<t>
The LP layer is based on the SILK codec
<xref target="SILK"></xref>.
It supports NB, MB, or WB audio and frame sizes from 10 ms to 60 ms,
and requires an additional 5 ms look-ahead for noise shaping estimation.
A small additional delay (up to 1.5 ms) may be required for sampling rate
conversion.
Like Vorbis <xref target='Vorbis-website'/> and many other modern codecs, SILK is inherently designed for
variable-bitrate (VBR) coding, though the encoder can also produce
constant-bitrate (CBR) streams.
The version of SILK used in Opus is substantially modified from, and not
compatible with, the stand-alone SILK codec previously deployed by Skype.
This document does not serve to define that format, but those interested in the
original SILK codec should see <xref target="SILK"/> instead.
</t>
<t>
The MDCT layer is based on the CELT codec <xref target="CELT"></xref>.
It supports NB, WB, SWB, or FB audio and frame sizes from 2.5 ms to
20 ms, and requires an additional 2.5 ms look-ahead due to the
overlapping MDCT windows.
The CELT codec is inherently designed for CBR coding, but unlike many CBR
codecs it is not limited to a set of predetermined rates.
It internally allocates bits to exactly fill any given target budget, and an
encoder can produce a VBR stream by varying the target on a per-frame basis.
The MDCT layer is not used for speech when the audio bandwidth is WB or less,
as it is not useful there.
On the other hand, non-speech signals are not always adequately coded using
linear prediction, so for music only the MDCT layer should be used.
</t>
<t>
A "Hybrid" mode allows the use of both layers simultaneously with a frame size
of 10 or 20 ms and a SWB or FB audio bandwidth.
The LP layer codes the low frequencies by resampling the signal down to WB.
The MDCT layer follows, coding the high frequency portion of the signal.
The cutoff between the two lies at 8 kHz, the maximum WB audio bandwidth.
In the MDCT layer, all bands below 8 kHz are discarded, so there is no
coding redundancy between the two layers.
</t>
<t>
The sample rate (in contrast to the actual audio bandwidth) can be chosen
independently on the encoder and decoder side, e.g., a fullband signal can be
decoded as wideband, or vice versa.
This approach ensures a sender and receiver can always interoperate, regardless
of the capabilities of their actual audio hardware.
Internally, the LP layer always operates at a sample rate of twice the audio
bandwidth, up to a maximum of 16 kHz, which it continues to use for SWB
and FB.
The decoder simply resamples its output to support different sample rates.
The MDCT layer always operates internally at a sample rate of 48 kHz.
Since all the supported sample rates evenly divide this rate, and since the
the decoder may easily zero out the high frequency portion of the spectrum in
the frequency domain, it can simply decimate the MDCT layer output to achieve
the other supported sample rates very cheaply.
</t>
<t>
After conversion to the common, desired output sample rate, the decoder simply
adds the output from the two layers together.
To compensate for the different look-ahead required by each layer, the CELT
encoder input is delayed by an additional 2.7 ms.
This ensures that low frequencies and high frequencies arrive at the same time.
This extra delay may be reduced by an encoder by using less look-ahead for noise
shaping or using a simpler resampler in the LP layer, but this will reduce
quality.
However, the base 2.5 ms look-ahead in the CELT layer cannot be reduced in
the encoder because it is needed for the MDCT overlap, whose size is fixed by
the decoder.
</t>
<t>
Both layers use the same entropy coder, avoiding any waste from "padding bits"
between them.
The hybrid approach makes it easy to support both CBR and VBR coding.
Although the LP layer is VBR, the bit allocation of the MDCT layer can produce
a final stream that is CBR by using all the bits left unused by the LP layer.
</t>
<section title="Control Parameters">
<t>
The Opus codec includes a number of control parameters which can be changed dynamically during
regular operation of the codec, without interrupting the audio stream from the encoder to the decoder.
These parameters only affect the encoder since any impact they have on the bit-stream is signaled
in-band such that a decoder can decode any Opus stream without any out-of-band signaling. Any Opus
implementation can add or modify these control parameters without affecting interoperability. The most
important encoder control parameters in the reference encoder are listed below.
</t>
<section title="Bitrate" toc="exlcude">
<t>
Opus supports all bitrates from 6 kb/s to 510 kb/s. All other parameters being
equal, higher bitrate results in higher quality. For a frame size of 20 ms, these
are the bitrate "sweet spots" for Opus in various configurations:
<list style="symbols">
<t>8-12 kb/s for NB speech,</t>
<t>16-20 kb/s for WB speech,</t>
<t>28-40 kb/s for FB speech,</t>
<t>48-64 kb/s for FB mono music, and</t>
<t>64-128 kb/s for FB stereo music.</t>
</list>
</t>
</section>
<section title="Number of Channels (Mono/Stereo)" toc="exlcude">
<t>
Opus can transmit either mono or stereo frames within a single stream.
When decoding a mono frame in a stereo decoder, the left and right channels are
identical, and when decoding a stereo frame in a mono decoder, the mono output
is the average of the left and right channels.
In some cases, it is desirable to encode a stereo input stream in mono (e.g.,
because the bitrate is too low to encode stereo with sufficient quality).
The number of channels encoded can be selected in real-time, but by default the
reference encoder attempts to make the best decision possible given the
current bitrate.
</t>
</section>
<section title="Audio Bandwidth" toc="exlcude">
<t>
The audio bandwidths supported by Opus are listed in
<xref target="audio-bandwidth"/>.
Just like for the number of channels, any decoder can decode audio encoded at
any bandwidth.
For example, any Opus decoder operating at 8 kHz can decode a FB Opus
frame, and any Opus decoder operating at 48 kHz can decode a NB frame.
Similarly, the reference encoder can take a 48 kHz input signal and
encode it as NB.
The higher the audio bandwidth, the higher the required bitrate to achieve
acceptable quality.
The audio bandwidth can be explicitly specified in real-time, but by default
the reference encoder attempts to make the best bandwidth decision possible
given the current bitrate.
</t>
</section>
<section title="Frame Duration" toc="exlcude">
<t>
Opus can encode frames of 2.5, 5, 10, 20, 40 or 60 ms.
It can also combine multiple frames into packets of up to 120 ms.
For real-time applications, sending fewer packets per second reduces the
bitrate, since it reduces the overhead from IP, UDP, and RTP headers.
However, it increases latency and sensitivity to packet losses, as losing one
packet constitutes a loss of a bigger chunk of audio.
Increasing the frame duration also slightly improves coding efficiency, but the
gain becomes small for frame sizes above 20 ms.
For this reason, 20 ms frames are a good choice for most applications.
</t>
</section>
<section title="Complexity" toc="exlcude">
<t>
There are various aspects of the Opus encoding process where trade-offs
can be made between CPU complexity and quality/bitrate. In the reference
encoder, the complexity is selected using an integer from 0 to 10, where
0 is the lowest complexity and 10 is the highest. Examples of
computations for which such trade-offs may occur are:
<list style="symbols">
<t>The order of the pitch analysis whitening filter <xref target="Whitening"/>,</t>
<t>The order of the short-term noise shaping filter,</t>
<t>The number of states in delayed decision quantization of the
residual signal, and</t>
<t>The use of certain bit-stream features such as variable time-frequency
resolution and the pitch post-filter.</t>
</list>
</t>
</section>
<section title="Packet Loss Resilience" toc="exlcude">
<t>
Audio codecs often exploit inter-frame correlations to reduce the
bitrate at a cost in error propagation: after losing one packet
several packets need to be received before the decoder is able to
accurately reconstruct the speech signal. The extent to which Opus
exploits inter-frame dependencies can be adjusted on the fly to
choose a trade-off between bitrate and amount of error propagation.
</t>
</section>
<section title="Forward Error Correction (FEC)" toc="exlcude">
<t>
Another mechanism providing robustness against packet loss is the in-band
Forward Error Correction (FEC). Packets that are determined to
contain perceptually important speech information, such as onsets or
transients, are encoded again at a lower bitrate and this re-encoded
information is added to a subsequent packet.
</t>
</section>
<section title="Constant/Variable Bitrate" toc="exlcude">
<t>
Opus is more efficient when operating with variable bitrate (VBR), which is
the default. However, in some (rare) applications, constant bitrate (CBR)
is required. There are two main reasons to operate in CBR mode:
<list style="symbols">
<t>When the transport only supports a fixed size for each compressed frame</t>
<t>When encryption is used for an audio stream that is either highly constrained
(e.g. yes/no, recorded prompts) or highly sensitive <xref target="SRTP-VBR"></xref> </t>
</list>
When low-latency transmission is required over a relatively slow connection, then
constrained VBR can also be used. This uses VBR in a way that simulates a
"bit reservoir" and is equivalent to what MP3 (MPEG 1, Layer 3) and
AAC (Advanced Audio Coding) call CBR (i.e., not true
CBR due to the bit reservoir).
</t>
</section>
<section title="Discontinuous Transmission (DTX)" toc="exlcude">
<t>
Discontinuous Transmission (DTX) reduces the bitrate during silence
or background noise. When DTX is enabled, only one frame is encoded
every 400 milliseconds.
</t>
</section>
</section>
</section>
<section anchor="modes" title="Internal Framing">
<t>
The Opus encoder produces "packets", which are each a contiguous set of bytes
meant to be transmitted as a single unit.
The packets described here do not include such things as IP, UDP, or RTP
headers which are normally found in a transport-layer packet.
A single packet may contain multiple audio frames, so long as they share a
common set of parameters, including the operating mode, audio bandwidth, frame
size, and channel count (mono vs. stereo).
This section describes the possible combinations of these parameters and the
internal framing used to pack multiple frames into a single packet.
This framing is not self-delimiting.
Instead, it assumes that a higher layer (such as UDP or RTP <xref target='RFC3550'/>
or Ogg <xref target='RFC3533'/> or Matroska <xref target='Matroska-website'/>)
will communicate the length, in bytes, of the packet, and it uses this
information to reduce the framing overhead in the packet itself.
A decoder implementation MUST support the framing described in this section.
An alternative, self-delimiting variant of the framing is described in
<xref target="self-delimiting-framing"/>.
Support for that variant is OPTIONAL.
</t>
<t>
All bit diagrams in this document number the bits so that bit 0 is the most
significant bit of the first byte, and bit 7 is the least significant.
Bit 8 is thus the most significant bit of the second byte, etc.
Well-formed Opus packets obey certain requirements, marked [R1] through [R7]
below.
These are summarized in <xref target="malformed-packets"/> along with
appropriate means of handling malformed packets.
</t>
<section anchor="toc_byte" title="The TOC Byte">
<t anchor="R1">
A well-formed Opus packet MUST contain at least one byte [R1].
This byte forms a table-of-contents (TOC) header that signals which of the
various modes and configurations a given packet uses.
It is composed of a configuration number, "config", a stereo flag, "s", and a
frame count code, "c", arranged as illustrated in
<xref target="toc_byte_fig"/>.
A description of each of these fields follows.
</t>
<figure anchor="toc_byte_fig" title="The TOC Byte">
<artwork align="center"><![CDATA[
0
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
| config |s| c |
+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
<t>
The top five bits of the TOC byte, labeled "config", encode one of 32 possible
configurations of operating mode, audio bandwidth, and frame size.
As described, the LP (SILK) layer and MDCT (CELT) layer can be combined in three possible
operating modes:
<list style="numbers">
<t>A SILK-only mode for use in low bitrate connections with an audio bandwidth
of WB or less,</t>
<t>A Hybrid (SILK+CELT) mode for SWB or FB speech at medium bitrates, and</t>
<t>A CELT-only mode for very low delay speech transmission as well as music
transmission (NB to FB).</t>
</list>
The 32 possible configurations each identify which one of these operating modes
the packet uses, as well as the audio bandwidth and the frame size.
<xref target="config_bits"/> lists the parameters for each configuration.
</t>
<texttable anchor="config_bits" title="TOC Byte Configuration Parameters">
<ttcol>Configuration Number(s)</ttcol>
<ttcol>Mode</ttcol>
<ttcol>Bandwidth</ttcol>
<ttcol>Frame Sizes</ttcol>
<c>0...3</c> <c>SILK-only</c> <c>NB</c> <c>10, 20, 40, 60 ms</c>
<c>4...7</c> <c>SILK-only</c> <c>MB</c> <c>10, 20, 40, 60 ms</c>
<c>8...11</c> <c>SILK-only</c> <c>WB</c> <c>10, 20, 40, 60 ms</c>
<c>12...13</c> <c>Hybrid</c> <c>SWB</c> <c>10, 20 ms</c>
<c>14...15</c> <c>Hybrid</c> <c>FB</c> <c>10, 20 ms</c>
<c>16...19</c> <c>CELT-only</c> <c>NB</c> <c>2.5, 5, 10, 20 ms</c>
<c>20...23</c> <c>CELT-only</c> <c>WB</c> <c>2.5, 5, 10, 20 ms</c>
<c>24...27</c> <c>CELT-only</c> <c>SWB</c> <c>2.5, 5, 10, 20 ms</c>
<c>28...31</c> <c>CELT-only</c> <c>FB</c> <c>2.5, 5, 10, 20 ms</c>
</texttable>
<t>
The configuration numbers in each range (e.g., 0...3 for NB SILK-only)
correspond to the various choices of frame size, in the same order.
For example, configuration 0 has a 10 ms frame size and configuration 3
has a 60 ms frame size.
</t>
<t>
One additional bit, labeled "s", signals mono vs. stereo, with 0 indicating
mono and 1 indicating stereo.
</t>
<t>
The remaining two bits of the TOC byte, labeled "c", code the number of frames
per packet (codes 0 to 3) as follows:
<list style="symbols">
<t>0: 1 frame in the packet</t>
<t>1: 2 frames in the packet, each with equal compressed size</t>
<t>2: 2 frames in the packet, with different compressed sizes</t>
<t>3: an arbitrary number of frames in the packet</t>
</list>
This draft refers to a packet as a code 0 packet, code 1 packet, etc., based on
the value of "c".
</t>
</section>
<section title="Frame Packing">
<t>
This section describes how frames are packed according to each possible value
of "c" in the TOC byte.
</t>
<section anchor="frame-length-coding" title="Frame Length Coding">
<t>
When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed
length of one or more of these frames is indicated with a one- or two-byte
sequence, with the meaning of the first byte as follows:
<list style="symbols">
<t>0: No frame (discontinuous transmission (DTX) or lost packet)</t>
<t>1...251: Length of the frame in bytes</t>
<t>252...255: A second byte is needed. The total length is (second_byte*4)+first_byte</t>
</list>
</t>
<t>
The special length 0 indicates that no frame is available, either because it
was dropped during transmission by some intermediary or because the encoder
chose not to transmit it.
Any Opus frame in any mode MAY have a length of 0.
</t>
<t>
The maximum representable length is 255*4+255=1275 bytes.
For 20 ms frames, this represents a bitrate of 510 kb/s, which is
approximately the highest useful rate for lossily compressed fullband stereo
music.
Beyond this point, lossless codecs are more appropriate.
It is also roughly the maximum useful rate of the MDCT layer, as shortly
thereafter quality no longer improves with additional bits due to limitations
on the codebook sizes.
</t>
<t anchor="R2">
No length is transmitted for the last frame in a VBR packet, or for any of the
frames in a CBR packet, as it can be inferred from the total size of the
packet and the size of all other data in the packet.
However, the length of any individual frame MUST NOT exceed
1275 bytes [R2], to allow for repacketization by gateways,
conference bridges, or other software.
</t>
</section>
<section title="Code 0: One Frame in the Packet">
<t>
For code 0 packets, the TOC byte is immediately followed by N-1 bytes
of compressed data for a single frame (where N is the size of the packet),
as illustrated in <xref target="code0_packet"/>.
</t>
<figure anchor="code0_packet" title="A Code 0 Packet" align="center">
<artwork align="center"><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|0|0| |
+-+-+-+-+-+-+-+-+ |
| Compressed frame 1 (N-1 bytes)... :
: |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
</section>
<section title="Code 1: Two Frames in the Packet, Each with Equal Compressed Size">
<t anchor="R3">
For code 1 packets, the TOC byte is immediately followed by the
(N-1)/2 bytes of compressed data for the first frame, followed by
(N-1)/2 bytes of compressed data for the second frame, as illustrated in
<xref target="code1_packet"/>.
The number of payload bytes available for compressed data, N-1, MUST be even
for all code 1 packets [R3].
</t>
<figure anchor="code1_packet" title="A Code 1 Packet" align="center">
<artwork align="center"><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|0|1| |
+-+-+-+-+-+-+-+-+ :
| Compressed frame 1 ((N-1)/2 bytes)... |
: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ :
| Compressed frame 2 ((N-1)/2 bytes)... |
: +-+-+-+-+-+-+-+-+
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
</section>
<section title="Code 2: Two Frames in the Packet, with Different Compressed Sizes">
<t anchor="R4">
For code 2 packets, the TOC byte is followed by a one- or two-byte sequence
indicating the length of the first frame (marked N1 in <xref target='code2_packet'/>),
followed by N1 bytes of compressed data for the first frame.
The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the
second frame.
This is illustrated in <xref target="code2_packet"/>.
A code 2 packet MUST contain enough bytes to represent a valid length.
For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2
packet whose second byte is in the range 252...255 is also invalid.
The length of the first frame, N1, MUST also be no larger than the size of the
payload remaining after decoding that length for all code 2 packets [R4].
This makes, for example, a 2-byte code 2 packet with a second byte in the range
1...251 invalid as well (the only valid 2-byte code 2 packet is one where the
length of both frames is zero).
</t>
<figure anchor="code2_packet" title="A Code 2 Packet" align="center">
<artwork align="center"><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|1|0| N1 (1-2 bytes): |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ :
| Compressed frame 1 (N1 bytes)... |
: +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| Compressed frame 2... :
: |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
</section>
<section title="Code 3: A Signaled Number of Frames in the Packet">
<t anchor="R5">
Code 3 packets signal the number of frames, as well as additional
padding, called "Opus padding" to indicate that this padding is added at the
Opus layer, rather than at the transport layer.
Code 3 packets MUST have at least 2 bytes [R6,R7].
The TOC byte is followed by a byte encoding the number of frames in the packet
in bits 2 to 7 (marked "M" in <xref target='frame_count_byte'/>), with bit 1 indicating whether
or not Opus padding is inserted (marked "p" in <xref target='frame_count_byte'/>), and bit 0
indicating VBR (marked "v" in <xref target='frame_count_byte'/>).
M MUST NOT be zero, and the audio duration contained within a packet MUST NOT
exceed 120 ms [R5].
This limits the maximum frame count for any frame size to 48 (for 2.5 ms
frames), with lower limits for longer frame sizes.
<xref target="frame_count_byte"/> illustrates the layout of the frame count
byte.
</t>
<figure anchor="frame_count_byte" title="The frame count byte">
<artwork align="center"><![CDATA[
0
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|v|p| M |
+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
<t>
When Opus padding is used, the number of bytes of padding is encoded in the
bytes following the frame count byte.
Values from 0...254 indicate that 0...254 bytes of padding are included,
in addition to the byte(s) used to indicate the size of the padding.
If the value is 255, then the size of the additional padding is 254 bytes,
plus the padding value encoded in the next byte.
There MUST be at least one more byte in the packet in this case [R6,R7].
The additional padding bytes appear at the end of the packet, and MUST be set
to zero by the encoder to avoid creating a covert channel.
The decoder MUST accept any value for the padding bytes, however.
</t>
<t>
Although this encoding provides multiple ways to indicate a given number of
padding bytes, each uses a different number of bytes to indicate the padding
size, and thus will increase the total packet size by a different amount.
For example, to add 255 bytes to a packet, set the padding bit, p, to 1, insert
a single byte after the frame count byte with a value of 254, and append 254
padding bytes with the value zero to the end of the packet.
To add 256 bytes to a packet, set the padding bit to 1, insert two bytes after
the frame count byte with the values 255 and 0, respectively, and append 254
padding bytes with the value zero to the end of the packet.
By using the value 255 multiple times, it is possible to create a packet of any
specific, desired size.
Let P be the number of header bytes used to indicate the padding size plus the
number of padding bytes themselves (i.e., P is the total number of bytes added
to the packet).
Then P MUST be no more than N-2 [R6,R7].
</t>
<t anchor="R6">
In the CBR case, let R=N-2-P be the number of bytes remaining in the packet
after subtracting the (optional) padding.
Then the compressed length of each frame in bytes is equal to R/M.
The value R MUST be a non-negative integer multiple of M [R6].
The compressed data for all M frames follows, each of size
R/M bytes, as illustrated in <xref target="code3cbr_packet"/>.
</t>
<figure anchor="code3cbr_packet" title="A CBR Code 3 Packet" align="center">
<artwork align="center"><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|1|1|0|p| M | Padding length (Optional) :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
: Compressed frame 1 (R/M bytes)... :
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
: Compressed frame 2 (R/M bytes)... :
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
: ... :
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
: Compressed frame M (R/M bytes)... :
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
: Opus Padding (Optional)... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
<t anchor="R7">
In the VBR case, the (optional) padding length is followed by M-1 frame
lengths (indicated by "N1" to "N[M-1]" in <xref target='code3vbr_packet'/>), each encoded in a
one- or two-byte sequence as described above.
The packet MUST contain enough data for the M-1 lengths after removing the
(optional) padding, and the sum of these lengths MUST be no larger than the
number of bytes remaining in the packet after decoding them [R7].
The compressed data for all M frames follows, each frame consisting of the
indicated number of bytes, with the final frame consuming any remaining bytes
before the final padding, as illustrated in <xref target="code3cbr_packet"/>.
The number of header bytes (TOC byte, frame count byte, padding length bytes,
and frame length bytes), plus the signaled length of the first M-1 frames themselves,
plus the signaled length of the padding MUST be no larger than N, the total size of the
packet.
</t>
<figure anchor="code3vbr_packet" title="A VBR Code 3 Packet" align="center">
<artwork align="center"><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|1|1|1|p| M | Padding length (Optional) :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
: N1 (1-2 bytes): N2 (1-2 bytes): ... : N[M-1] |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
: Compressed frame 1 (N1 bytes)... :
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
: Compressed frame 2 (N2 bytes)... :
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
: ... :
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
: Compressed frame M... :
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
: Opus Padding (Optional)... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
</section>
</section>
<section anchor="examples" title="Examples">
<t>
Simplest case, one NB mono 20 ms SILK frame:
</t>
<figure anchor='framing_example_1'>
<artwork><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 1 |0|0|0| compressed data... :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
<t>
Two FB mono 5 ms CELT frames of the same compressed size:
</t>
<figure anchor='framing_example_2'>
<artwork><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 29 |0|0|1| compressed data... :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
<t>
Two FB mono 20 ms Hybrid frames of different compressed size:
</t>
<figure anchor='framing_example_3'>
<artwork><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 15 |0|1|1|1|0| 2 | N1 | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
| compressed data... :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
<t>
Four FB stereo 20 ms CELT frames of the same compressed size:
</t>
<figure anchor='framing_example_4'>
<artwork><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 31 |1|1|1|0|0| 4 | compressed data... :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
</section>
<section anchor="malformed-packets" title="Receiving Malformed Packets">
<t>
A receiver MUST NOT process packets which violate any of the rules above as
normal Opus packets.
They are reserved for future applications, such as in-band headers (containing
metadata, etc.).
Packets which violate these constraints may cause implementations of
<spanx style="emph">this</spanx> specification to treat them as malformed, and
discard them.
</t>
<t>
These constraints are summarized here for reference:
<list style="format [R%d]">
<t>Packets are at least one byte.</t>
<t>No implicit frame length is larger than 1275 bytes.</t>
<t>Code 1 packets have an odd total length, N, so that (N-1)/2 is an
integer.</t>
<t>Code 2 packets have enough bytes after the TOC for a valid frame
length, and that length is no larger than the number of bytes remaining in the
packet.</t>
<t>Code 3 packets contain at least one frame, but no more than 120 ms
of audio total.</t>
<t>The length of a CBR code 3 packet, N, is at least two bytes, the number of
bytes added to indicate the padding size plus the trailing padding bytes
themselves, P, is no more than N-2, and the frame count, M, satisfies
the constraint that (N-2-P) is a non-negative integer multiple of M.</t>
<t>VBR code 3 packets are large enough to contain all the header bytes (TOC
byte, frame count byte, any padding length bytes, and any frame length bytes),
plus the length of the first M-1 frames, plus any trailing padding bytes.</t>
</list>
</t>
</section>
</section>
<section title="Opus Decoder">
<t>
The Opus decoder consists of two main blocks: the SILK decoder and the CELT
decoder.
At any given time, one or both of the SILK and CELT decoders may be active.
The output of the Opus decode is the sum of the outputs from the SILK and CELT
decoders with proper sample rate conversion and delay compensation on the SILK
side, and optional decimation (when decoding to sample rates less than
48 kHz) on the CELT side, as illustrated in the block diagram below.
</t>
<figure>
<artwork>
<![CDATA[
+---------+ +------------+
| SILK | | Sample |
+->| Decoder |--->| Rate |----+
Bit- +---------+ | | | | Conversion | v
stream | Range |---+ +---------+ +------------+ /---\ Audio
------->| Decoder | | + |------>
| |---+ +---------+ +------------+ \---/
+---------+ | | CELT | | Decimation | ^
+->| Decoder |--->| (Optional) |----+
| | | |
+---------+ +------------+
]]>
</artwork>
</figure>
<section anchor="range-decoder" title="Range Decoder">
<t>
Opus uses an entropy coder based on range coding <xref target="range-coding"></xref>
<xref target="Martin79"></xref>,
which is itself a rediscovery of the FIFO arithmetic code introduced by <xref target="coding-thesis"></xref>.
It is very similar to arithmetic encoding, except that encoding is done with
digits in any base instead of with bits,
so it is faster when using larger bases (i.e., a byte). All of the
calculations in the range coder must use bit-exact integer arithmetic.
</t>
<t>
Symbols may also be coded as "raw bits" packed directly into the bitstream,
bypassing the range coder.
These are packed backwards starting at the end of the frame, as illustrated in
<xref target="rawbits-example"/>.
This reduces complexity and makes the stream more resilient to bit errors, as
corruption in the raw bits will not desynchronize the decoding process, unlike
corruption in the input to the range decoder.
Raw bits are only used in the CELT layer.
</t>
<figure anchor="rawbits-example" title="Illustrative example of packing range
coder and raw bits data">
<artwork align="center"><![CDATA[
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Range coder data (packed MSB to LSB) -> :
+ +
: :
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
: | <- Boundary occurs at an arbitrary bit position :
+-+-+-+ +
: <- Raw bits data (packed LSB to MSB) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
</figure>
<t>
Each symbol coded by the range coder is drawn from a finite alphabet and coded
in a separate "context", which describes the size of the alphabet and the
relative frequency of each symbol in that alphabet.
</t>
<t>
Suppose there is a context with n symbols, identified with an index that ranges
from 0 to n-1.
The parameters needed to encode or decode symbol k in this context are
represented by a three-tuple (fl[k], fh[k], ft), with
0 <= fl[k] < fh[k] <= ft <= 65535.
The values of this tuple are derived from the probability model for the
symbol, represented by traditional "frequency counts".
Because Opus uses static contexts these are not updated as symbols are decoded.
Let f[i] be the frequency of symbol i.
Then the three-tuple corresponding to symbol k is given by
</t>
<figure align="center">
<artwork align="center"><![CDATA[
k-1 n-1
__ __
fl[k] = \ f[i], fh[k] = fl[k] + f[k], ft = \ f[i]
/_ /_
i=0 i=0
]]></artwork>
</figure>
<t>
The range decoder extracts the symbols and integers encoded using the range
encoder in <xref target="range-encoder"/>.
The range decoder maintains an internal state vector composed of the two-tuple
(val, rng), representing the difference between the high end of the
current range and the actual coded value, minus one, and the size of the
current range, respectively.
Both val and rng are 32-bit unsigned integer values.
</t>
<section anchor="range-decoder-init" title="Range Decoder Initialization">
<t>
Let b0 be the first input byte (or zero if there are no bytes in this Opus
frame).
The decoder initializes rng to 128 and initializes val to
(127 - (b0>>1)), where (b0>>1) is the top 7 bits of the
first input byte.
It saves the remaining bit, (b0&1), for use in the renormalization
procedure described in <xref target="range-decoder-renorm"/>, which the
decoder invokes immediately after initialization to read additional bits and
establish the invariant that rng > 2**23.
</t>
</section>
<section anchor="decoding-symbols" title="Decoding Symbols">
<t>
Decoding a symbol is a two-step process.
The first step determines a 16-bit unsigned value fs, which lies within the
range of some symbol in the current context.
The second step updates the range decoder state with th