Skip to content

gh-135676: Lexical analysis: Reword String literals and related sections #135942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
WIP
  • Loading branch information
encukou committed Jun 25, 2025
commit e44fa66cf2da63763a3ed37f7d59da28e95c785c
5 changes: 1 addition & 4 deletions Doc/reference/grammar.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,8 @@ error recovery.

The notation used here is the same as in the preceding docs,
and is described in the :ref:`notation <notation>` section,
except for a few extra complications:
except for an extra complication:

* ``&e``: a positive lookahead (that is, ``e`` is required to match but
not consumed)
* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
* ``~`` ("cut"): commit to the current alternative and fail the rule
even if this fails to parse

Expand Down
16 changes: 12 additions & 4 deletions Doc/reference/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,15 +145,23 @@ The definition to the right of the colon uses the following syntax elements:
* ``e?``: A question mark has exactly the same meaning as square brackets:
the preceding item is optional.
* ``(e)``: Parentheses are used for grouping.

The following notation is only used in
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.

* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
of any single character in the given (inclusive) range of ASCII characters.
This notation is only used in
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
* ``<...>``: A phrase between angular brackets gives an informal description
of the matched symbol (for example, ``<any ASCII character except "\">``),
or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
This notation is only used in
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.

.. _lexical-lookaheads:

Some definitions also use *lookaheads*, which indicate that an element
must (or must not) match at a given position, but without consuming any input:

* ``&e``: a positive lookahead (that is, ``e`` is required to match)
* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)

The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
the vertical bar (``|``) binds most loosely.
Expand Down
110 changes: 63 additions & 47 deletions Doc/reference/lexical_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,25 +39,37 @@ The end of a logical line is represented by the token :data:`~token.NEWLINE`.
Statements cannot cross logical line boundaries except where :data:`!NEWLINE`
is allowed by the syntax (e.g., between statements in compound statements).
A logical line is constructed from one or more *physical lines* by following
the explicit or implicit *line joining* rules.
the :ref:`explicit <explicit-joining>` or :ref:`implicit <implicit-joining>`
*line joining* rules.


.. _physical-lines:

Physical lines
--------------

A physical line is a sequence of characters terminated by an end-of-line
sequence. In source files and strings, any of the standard platform line
termination sequences can be used - the Unix form using ASCII LF (linefeed),
the Windows form using the ASCII sequence CR LF (return followed by linefeed),
or the old Macintosh form using the ASCII CR (return) character. All of these
forms can be used equally, regardless of platform. The end of input also serves
as an implicit terminator for the final physical line.
A physical line is a sequence of characters terminated by one the following
end-of-line sequences:

When embedding Python, source code strings should be passed to Python APIs using
the standard C conventions for newline characters (the ``\n`` character,
representing ASCII LF, is the line terminator).
* the Unix form using ASCII LF (linefeed),
* the Windows form using the ASCII sequence CR LF (return followed by linefeed),
* the old Macintosh form using the ASCII CR (return) character.

Regardless of platform, each of these sequences is replaced by a single
ASCII LF (linefeed) character.
(This is done even inside :ref:`string literals <strings>`.)
Each line can use any of the sequences; they do not need to be consistent
within a file.

The end of input also serves as an implicit terminator for the final
physical line.

Formally:

.. grammar-snippet::
:group: python-grammar

newline: <ASCII LF> | <ASCII CR> <ASCII LF> | <ASCII CR>


.. _comments:
Expand Down Expand Up @@ -484,14 +496,21 @@ Literals

Literals are notations for constant values of some built-in types.

In terms of lexical analysis, Python has :ref:`string, bytes <strings>`
and :ref:`numeric <numbers>` literals.

Other “literals” are lexically denoted using :ref:`keywords <keywords>`
(``None``, ``True``, ``False``) and the special
:ref:`ellipsis token <lexical-ellipsis>` (``...``):


.. index:: string literal, bytes literal, ASCII
single: ' (single quote); string literal
single: " (double quote); string literal
.. _strings:

String and Bytes literals
-------------------------
=========================

String literals are text enclosed in single quotes (``'``) or double
quotes (``"``). For example:
Expand Down Expand Up @@ -635,41 +654,26 @@ They may not be combined with ``'b'``, ``'u'``, or each other.


String literals, except "F-strings" and "T-strings", are described by the
following lexical definitions:
following lexical definitions.

These definitions use :ref:`negative lookaheads <lexical-lookaheads>` (``!``)
to indicate that an ending quote ends the literal.

.. grammar-snippet::
:group: python-grammar

STRING: stringliteral | bytesliteral | fstring | tstring

stringliteral: [`stringprefix`](`stringcontent`)
stringprefix: <("r" | "u"), case-insensitive>
stringcontent: `quote` `stringitem`* <matching `quote`>
quote: "'" | '"' | "'''" | '"""'
STRING: [`stringprefix`] (`stringcontent`)
stringprefix: <("r" | "u" | "b" | "br" | "rb"), case-insensitive>
stringcontent:
| "'" ( !"'" `stringitem`)* "'"
| '"' ( !'"' `stringitem`)* '"'
| "'''" ( !"'''" `longstringitem`)* "'''"
| '"""' ( !'"""' `longstringitem`)* '"""'
stringitem: `stringchar` | `stringescapeseq`
stringchar: <any `source_character`, except as listed below>
stringchar: <any `source_character`, except backslash and newline>
longstringitem: `stringitem` | newline
stringescapeseq: "\" <any `source_character`>

``stringchar`` can not include:

- the backslash, ``\``;
- in triple-quoted strings (quoted by ``'''`` or ``"""``), the newline;
- the quote character.


.. grammar-snippet::
:group: python-grammar

bytesliteral: `bytesprefix`(`shortbytes` | `longbytes`)
bytesprefix: <("b" | "br" | "rb" ), case-insensitive>
shortbytes: "'" `shortbytesitem`* "'" | '"' `shortbytesitem`* '"'
longbytes: "'''" `longbytesitem`* "'''" | '"""' `longbytesitem`* '"""'
shortbytesitem: `shortbyteschar` | `bytesescapeseq`
longbytesitem: `longbyteschar` | `bytesescapeseq`
shortbyteschar: <any ASCII `source_character` except "\" or newline or the quote>
longbyteschar: <any ASCII `source_character` except "\">
bytesescapeseq: "\" <any ASCII `source_character`>

Note that as in all lexical definitions, whitespace is significant.
The prefix, if any, must be followed immediately by the quoted string content.

Expand All @@ -692,7 +696,7 @@ The prefix, if any, must be followed immediately by the quoted string content.
.. _escape-sequences:

Escape sequences
^^^^^^^^^^^^^^^^
----------------

Unless an ``'r'`` or ``'R'`` prefix is present, escape sequences in string and
bytes literals are interpreted according to rules similar to those used by
Expand Down Expand Up @@ -985,7 +989,7 @@ and :meth:`str.format`, which uses a related format string mechanism.
.. _numbers:

Numeric literals
----------------
================

.. index:: number, numeric literal, integer literal
floating-point literal, hexadecimal literal
Expand Down Expand Up @@ -1241,14 +1245,26 @@ The following tokens serve as delimiters in the grammar:

( ) [ ] { }
, : ! . ; @ =

The period can also occur in floating-point and imaginary literals.

.. _lexical-ellipsis:

A sequence of three periods has a special meaning as an
:py:data:`Ellipsis` literal:

.. code-block:: none

...

The following *augmented assignment operators* serve
lexically as delimiters, but also perform an operation:

.. code-block:: none

-> += -= *= /= //= %=
@= &= |= ^= >>= <<= **=

The period can also occur in floating-point and imaginary literals. A sequence
of three periods has a special meaning as an ellipsis literal. The second half
of the list, the augmented assignment operators, serve lexically as delimiters,
but also perform an operation.

The following printing ASCII characters have special meaning as part of other
tokens or are otherwise significant to the lexical analyzer:

Expand Down
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy