2009-11-09 19:38:48 +01:00
|
|
|
/*-------------------------------------------------------------------------
|
|
|
|
*
|
|
|
|
* scanner.h
|
|
|
|
* API for the core scanner (flex machine)
|
|
|
|
*
|
2017-04-05 06:38:25 +02:00
|
|
|
* The core scanner is also used by PL/pgSQL, so we provide a public API
|
2014-05-06 18:12:18 +02:00
|
|
|
* for it. However, the rest of the backend is only expected to use the
|
2009-11-09 19:38:48 +01:00
|
|
|
* higher-level API provided by parser.h.
|
|
|
|
*
|
|
|
|
*
|
2020-01-01 18:21:45 +01:00
|
|
|
* Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
|
2009-11-09 19:38:48 +01:00
|
|
|
* Portions Copyright (c) 1994, Regents of the University of California
|
|
|
|
*
|
2010-09-20 22:08:53 +02:00
|
|
|
* src/include/parser/scanner.h
|
2009-11-09 19:38:48 +01:00
|
|
|
*
|
|
|
|
*-------------------------------------------------------------------------
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef SCANNER_H
|
|
|
|
#define SCANNER_H
|
|
|
|
|
2016-03-24 01:22:08 +01:00
|
|
|
#include "common/keywords.h"
|
2009-11-09 19:38:48 +01:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The scanner returns extra data about scanned tokens in this union type.
|
|
|
|
* Note that this is a subset of the fields used in YYSTYPE of the bison
|
|
|
|
* parsers built atop the scanner.
|
|
|
|
*/
|
|
|
|
typedef union core_YYSTYPE
|
|
|
|
{
|
|
|
|
int ival; /* for integer literals */
|
|
|
|
char *str; /* for identifiers and non-integer literals */
|
|
|
|
const char *keyword; /* canonical spelling of keywords */
|
|
|
|
} core_YYSTYPE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We track token locations in terms of byte offsets from the start of the
|
|
|
|
* source string, not the column number/line number representation that
|
|
|
|
* bison uses by default. Also, to minimize overhead we track only one
|
|
|
|
* location (usually the first token location) for each construct, not
|
|
|
|
* the beginning and ending locations as bison does by default. It's
|
|
|
|
* therefore sufficient to make YYLTYPE an int.
|
|
|
|
*/
|
|
|
|
#define YYLTYPE int
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Another important component of the scanner's API is the token code numbers.
|
|
|
|
* However, those are not defined in this file, because bison insists on
|
|
|
|
* defining them for itself. The token codes used by the core scanner are
|
|
|
|
* the ASCII characters plus these:
|
Reduce size of backend scanner's tables.
Previously, the core scanner's yy_transition[] array had 37045 elements.
Since that number is larger than INT16_MAX, Flex generated the array to
contain 32-bit integers. By reimplementing some of the bulkier scanner
rules, this patch reduces the array to 20495 elements. The much smaller
total length, combined with the consequent use of 16-bit integers for
the array elements reduces the binary size by over 200kB. This was
accomplished in two ways:
1. Consolidate handling of quote continuations into a new start condition,
rather than duplicating that logic for five different string types.
2. Treat Unicode strings and identifiers followed by a UESCAPE sequence
as three separate tokens, rather than one. The logic to de-escape
Unicode strings is moved to the filter code in parser.c, which already
had the ability to provide special processing for token sequences.
While we could have implemented the conversion in the grammar, that
approach was rejected for performance and maintainability reasons.
Performance in microbenchmarks of raw parsing seems equal or slightly
faster in most cases, and it's reasonable to expect that in real-world
usage (with more competition for the CPU cache) there will be a larger
win. The exception is UESCAPE sequences; lexing those is about 10%
slower, primarily because the scanner now has to be called three times
rather than one. This seems acceptable since that feature is very
rarely used.
The psql and epcg lexers are likewise modified, primarily because we
want to keep them all in sync. Since those lexers don't use the
space-hogging -CF option, the space savings is much less, but it's
still good for perhaps 10kB apiece.
While at it, merge the ecpg lexer's handling of C-style comments used
in SQL and in C. Those have different rules regarding nested comments,
but since we already have the ability to keep track of the previous
start condition, we can use that to handle both cases within a single
start condition. This matches the core scanner more closely.
John Naylor
Discussion: https://postgr.es/m/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw@mail.gmail.com
2020-01-13 21:04:31 +01:00
|
|
|
* %token <str> IDENT UIDENT FCONST SCONST USCONST BCONST XCONST Op
|
2009-11-09 19:38:48 +01:00
|
|
|
* %token <ival> ICONST PARAM
|
2015-03-10 16:48:34 +01:00
|
|
|
* %token TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
|
Make operator precedence follow the SQL standard more closely.
While the SQL standard is pretty vague on the overall topic of operator
precedence (because it never presents a unified BNF for all expressions),
it does seem reasonable to conclude from the spec for <boolean value
expression> that OR has the lowest precedence, then AND, then NOT, then IS
tests, then the six standard comparison operators, then everything else
(since any non-boolean operator in a WHERE clause would need to be an
argument of one of these).
We were only sort of on board with that: most notably, while "<" ">" and
"=" had properly low precedence, "<=" ">=" and "<>" were treated as generic
operators and so had significantly higher precedence. And "IS" tests were
even higher precedence than those, which is very clearly wrong per spec.
Another problem was that "foo NOT SOMETHING bar" constructs, such as
"x NOT LIKE y", were treated inconsistently because of a bison
implementation artifact: they had the documented precedence with respect
to operators to their right, but behaved like NOT (i.e., very low priority)
with respect to operators to their left.
Fixing the precedence issues is just a small matter of rearranging the
precedence declarations in gram.y, except for the NOT problem, which
requires adding an additional lookahead case in base_yylex() so that we
can attach a different token precedence to NOT LIKE and allied two-word
operators.
The bulk of this patch is not the bug fix per se, but adding logic to
parse_expr.c to allow giving warnings if an expression has changed meaning
because of these precedence changes. These warnings are off by default
and are enabled by the new GUC operator_precedence_warning. It's believed
that very few applications will be affected by these changes, but it was
agreed that a warning mechanism is essential to help debug any that are.
2015-03-11 18:22:52 +01:00
|
|
|
* %token LESS_EQUALS GREATER_EQUALS NOT_EQUALS
|
2009-11-09 19:38:48 +01:00
|
|
|
* The above token definitions *must* be the first ones declared in any
|
|
|
|
* bison parser built atop this scanner, so that they will have consistent
|
|
|
|
* numbers assigned to them (specifically, IDENT = 258 and so on).
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The YY_EXTRA data that a flex scanner allows us to pass around.
|
2014-05-06 18:12:18 +02:00
|
|
|
* Private state needed by the core scanner goes here. Note that the actual
|
2009-11-09 19:38:48 +01:00
|
|
|
* yy_extra struct may be larger and have this as its first component, thus
|
|
|
|
* allowing the calling parser to keep some fields of its own in YY_EXTRA.
|
|
|
|
*/
|
|
|
|
typedef struct core_yy_extra_type
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* The string the scanner is physically scanning. We keep this mainly so
|
|
|
|
* that we can cheaply compute the offset of the current token (yytext).
|
|
|
|
*/
|
|
|
|
char *scanbuf;
|
|
|
|
Size scanbuflen;
|
|
|
|
|
|
|
|
/*
|
Replace the data structure used for keyword lookup.
Previously, ScanKeywordLookup was passed an array of string pointers.
This had some performance deficiencies: the strings themselves might
be scattered all over the place depending on the compiler (and some
quick checking shows that at least with gcc-on-Linux, they indeed
weren't reliably close together). That led to very cache-unfriendly
behavior as the binary search touched strings in many different pages.
Also, depending on the platform, the string pointers might need to
be adjusted at program start, so that they couldn't be simple constant
data. And the ScanKeyword struct had been designed with an eye to
32-bit machines originally; on 64-bit it requires 16 bytes per
keyword, making it even more cache-unfriendly.
Redesign so that the keyword strings themselves are allocated
consecutively (as part of one big char-string constant), thereby
eliminating the touch-lots-of-unrelated-pages syndrome. And get
rid of the ScanKeyword array in favor of three separate arrays:
uint16 offsets into the keyword array, uint16 token codes, and
uint8 keyword categories. That reduces the overhead per keyword
to 5 bytes instead of 16 (even less in programs that only need
one of the token codes and categories); moreover, the binary search
only touches the offsets array, further reducing its cache footprint.
This also lets us put the token codes somewhere else than the
keyword strings are, which avoids some unpleasant build dependencies.
While we're at it, wrap the data used by ScanKeywordLookup into
a struct that can be treated as an opaque type by most callers.
That doesn't change things much right now, but it will make it
less painful to switch to a hash-based lookup method, as is being
discussed in the mailing list thread.
Most of the change here is associated with adding a generator
script that can build the new data structure from the same
list-of-PG_KEYWORD header representation we used before.
The PG_KEYWORD lists that plpgsql and ecpg used to embed in
their scanner .c files have to be moved into headers, and the
Makefiles have to be taught to invoke the generator script.
This work is also necessary if we're to consider hash-based lookup,
since the generator script is what would be responsible for
constructing a hash table.
Aside from saving a few kilobytes in each program that includes
the keyword table, this seems to speed up raw parsing (flex+bison)
by a few percent. So it's worth doing even as it stands, though
we think we can gain even more with a follow-on patch to switch
to hash-based lookup.
John Naylor, with further hacking by me
Discussion: https://postgr.es/m/CAJVSVGXdFVU2sgym89XPL=Lv1zOS5=EHHQ8XWNzFL=mTXkKMLw@mail.gmail.com
2019-01-06 23:02:57 +01:00
|
|
|
* The keyword list to use, and the associated grammar token codes.
|
2009-11-09 19:38:48 +01:00
|
|
|
*/
|
Replace the data structure used for keyword lookup.
Previously, ScanKeywordLookup was passed an array of string pointers.
This had some performance deficiencies: the strings themselves might
be scattered all over the place depending on the compiler (and some
quick checking shows that at least with gcc-on-Linux, they indeed
weren't reliably close together). That led to very cache-unfriendly
behavior as the binary search touched strings in many different pages.
Also, depending on the platform, the string pointers might need to
be adjusted at program start, so that they couldn't be simple constant
data. And the ScanKeyword struct had been designed with an eye to
32-bit machines originally; on 64-bit it requires 16 bytes per
keyword, making it even more cache-unfriendly.
Redesign so that the keyword strings themselves are allocated
consecutively (as part of one big char-string constant), thereby
eliminating the touch-lots-of-unrelated-pages syndrome. And get
rid of the ScanKeyword array in favor of three separate arrays:
uint16 offsets into the keyword array, uint16 token codes, and
uint8 keyword categories. That reduces the overhead per keyword
to 5 bytes instead of 16 (even less in programs that only need
one of the token codes and categories); moreover, the binary search
only touches the offsets array, further reducing its cache footprint.
This also lets us put the token codes somewhere else than the
keyword strings are, which avoids some unpleasant build dependencies.
While we're at it, wrap the data used by ScanKeywordLookup into
a struct that can be treated as an opaque type by most callers.
That doesn't change things much right now, but it will make it
less painful to switch to a hash-based lookup method, as is being
discussed in the mailing list thread.
Most of the change here is associated with adding a generator
script that can build the new data structure from the same
list-of-PG_KEYWORD header representation we used before.
The PG_KEYWORD lists that plpgsql and ecpg used to embed in
their scanner .c files have to be moved into headers, and the
Makefiles have to be taught to invoke the generator script.
This work is also necessary if we're to consider hash-based lookup,
since the generator script is what would be responsible for
constructing a hash table.
Aside from saving a few kilobytes in each program that includes
the keyword table, this seems to speed up raw parsing (flex+bison)
by a few percent. So it's worth doing even as it stands, though
we think we can gain even more with a follow-on patch to switch
to hash-based lookup.
John Naylor, with further hacking by me
Discussion: https://postgr.es/m/CAJVSVGXdFVU2sgym89XPL=Lv1zOS5=EHHQ8XWNzFL=mTXkKMLw@mail.gmail.com
2019-01-06 23:02:57 +01:00
|
|
|
const ScanKeywordList *keywordlist;
|
|
|
|
const uint16 *keyword_tokens;
|
2009-11-09 19:38:48 +01:00
|
|
|
|
Prevent duplicate escape-string warnings when using pg_stat_statements.
contrib/pg_stat_statements will sometimes run the core lexer a second time
on submitted statements. Formerly, if you had standard_conforming_strings
turned off, this led to sometimes getting two copies of any warnings
enabled by escape_string_warning. While this is probably no longer a big
deal in the field, it's a pain for regression testing.
To fix, change the lexer so it doesn't consult the escape_string_warning
GUC variable directly, but looks at a copy in the core_yy_extra_type state
struct. Then, pg_stat_statements can change that copy to disable warnings
while it's redoing the lexing.
It seemed like a good idea to make this happen for all three of the GUCs
consulted by the lexer, not just escape_string_warning. There's not an
immediate use-case for callers to adjust the other two AFAIK, but making
it possible is easy enough and seems like good future-proofing.
Arguably this is a bug fix, but there doesn't seem to be enough interest to
justify a back-patch. We'd not be able to back-patch exactly as-is anyway,
for fear of breaking ABI compatibility of the struct. (We could perhaps
back-patch the addition of only escape_string_warning by adding it at the
end of the struct, where there's currently alignment padding space.)
2015-01-23 00:10:47 +01:00
|
|
|
/*
|
|
|
|
* Scanner settings to use. These are initialized from the corresponding
|
|
|
|
* GUC variables by scanner_init(). Callers can modify them after
|
|
|
|
* scanner_init() if they don't want the scanner's behavior to follow the
|
|
|
|
* prevailing GUC settings.
|
|
|
|
*/
|
|
|
|
int backslash_quote;
|
|
|
|
bool escape_string_warning;
|
|
|
|
bool standard_conforming_strings;
|
|
|
|
|
2009-11-09 19:38:48 +01:00
|
|
|
/*
|
2010-02-26 03:01:40 +01:00
|
|
|
* literalbuf is used to accumulate literal values when multiple rules are
|
|
|
|
* needed to parse a single literal. Call startlit() to reset buffer to
|
|
|
|
* empty, addlit() to add text. NOTE: the string in literalbuf is NOT
|
|
|
|
* necessarily null-terminated, but there always IS room to add a trailing
|
|
|
|
* null at offset literallen. We store a null only when we need it.
|
2009-11-09 19:38:48 +01:00
|
|
|
*/
|
|
|
|
char *literalbuf; /* palloc'd expandable buffer */
|
|
|
|
int literallen; /* actual current string length */
|
|
|
|
int literalalloc; /* current allocated buffer size */
|
|
|
|
|
Reduce size of backend scanner's tables.
Previously, the core scanner's yy_transition[] array had 37045 elements.
Since that number is larger than INT16_MAX, Flex generated the array to
contain 32-bit integers. By reimplementing some of the bulkier scanner
rules, this patch reduces the array to 20495 elements. The much smaller
total length, combined with the consequent use of 16-bit integers for
the array elements reduces the binary size by over 200kB. This was
accomplished in two ways:
1. Consolidate handling of quote continuations into a new start condition,
rather than duplicating that logic for five different string types.
2. Treat Unicode strings and identifiers followed by a UESCAPE sequence
as three separate tokens, rather than one. The logic to de-escape
Unicode strings is moved to the filter code in parser.c, which already
had the ability to provide special processing for token sequences.
While we could have implemented the conversion in the grammar, that
approach was rejected for performance and maintainability reasons.
Performance in microbenchmarks of raw parsing seems equal or slightly
faster in most cases, and it's reasonable to expect that in real-world
usage (with more competition for the CPU cache) there will be a larger
win. The exception is UESCAPE sequences; lexing those is about 10%
slower, primarily because the scanner now has to be called three times
rather than one. This seems acceptable since that feature is very
rarely used.
The psql and epcg lexers are likewise modified, primarily because we
want to keep them all in sync. Since those lexers don't use the
space-hogging -CF option, the space savings is much less, but it's
still good for perhaps 10kB apiece.
While at it, merge the ecpg lexer's handling of C-style comments used
in SQL and in C. Those have different rules regarding nested comments,
but since we already have the ability to keep track of the previous
start condition, we can use that to handle both cases within a single
start condition. This matches the core scanner more closely.
John Naylor
Discussion: https://postgr.es/m/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw@mail.gmail.com
2020-01-13 21:04:31 +01:00
|
|
|
int state_before_str_stop; /* start cond. before end quote */
|
2009-11-09 19:38:48 +01:00
|
|
|
int xcdepth; /* depth of nesting in slash-star comments */
|
|
|
|
char *dolqstart; /* current $foo$ quote start string */
|
|
|
|
|
|
|
|
/* first part of UTF16 surrogate pair for Unicode escapes */
|
|
|
|
int32 utf16_first_part;
|
|
|
|
|
|
|
|
/* state variables for literal-lexing warnings */
|
|
|
|
bool warn_on_first_escape;
|
|
|
|
bool saw_non_ascii;
|
|
|
|
} core_yy_extra_type;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The type of yyscanner is opaque outside scan.l.
|
|
|
|
*/
|
|
|
|
typedef void *core_yyscan_t;
|
|
|
|
|
|
|
|
|
Replace the data structure used for keyword lookup.
Previously, ScanKeywordLookup was passed an array of string pointers.
This had some performance deficiencies: the strings themselves might
be scattered all over the place depending on the compiler (and some
quick checking shows that at least with gcc-on-Linux, they indeed
weren't reliably close together). That led to very cache-unfriendly
behavior as the binary search touched strings in many different pages.
Also, depending on the platform, the string pointers might need to
be adjusted at program start, so that they couldn't be simple constant
data. And the ScanKeyword struct had been designed with an eye to
32-bit machines originally; on 64-bit it requires 16 bytes per
keyword, making it even more cache-unfriendly.
Redesign so that the keyword strings themselves are allocated
consecutively (as part of one big char-string constant), thereby
eliminating the touch-lots-of-unrelated-pages syndrome. And get
rid of the ScanKeyword array in favor of three separate arrays:
uint16 offsets into the keyword array, uint16 token codes, and
uint8 keyword categories. That reduces the overhead per keyword
to 5 bytes instead of 16 (even less in programs that only need
one of the token codes and categories); moreover, the binary search
only touches the offsets array, further reducing its cache footprint.
This also lets us put the token codes somewhere else than the
keyword strings are, which avoids some unpleasant build dependencies.
While we're at it, wrap the data used by ScanKeywordLookup into
a struct that can be treated as an opaque type by most callers.
That doesn't change things much right now, but it will make it
less painful to switch to a hash-based lookup method, as is being
discussed in the mailing list thread.
Most of the change here is associated with adding a generator
script that can build the new data structure from the same
list-of-PG_KEYWORD header representation we used before.
The PG_KEYWORD lists that plpgsql and ecpg used to embed in
their scanner .c files have to be moved into headers, and the
Makefiles have to be taught to invoke the generator script.
This work is also necessary if we're to consider hash-based lookup,
since the generator script is what would be responsible for
constructing a hash table.
Aside from saving a few kilobytes in each program that includes
the keyword table, this seems to speed up raw parsing (flex+bison)
by a few percent. So it's worth doing even as it stands, though
we think we can gain even more with a follow-on patch to switch
to hash-based lookup.
John Naylor, with further hacking by me
Discussion: https://postgr.es/m/CAJVSVGXdFVU2sgym89XPL=Lv1zOS5=EHHQ8XWNzFL=mTXkKMLw@mail.gmail.com
2019-01-06 23:02:57 +01:00
|
|
|
/* Constant data exported from parser/scan.l */
|
|
|
|
extern PGDLLIMPORT const uint16 ScanKeywordTokens[];
|
|
|
|
|
2009-11-09 19:38:48 +01:00
|
|
|
/* Entry points in parser/scan.l */
|
|
|
|
extern core_yyscan_t scanner_init(const char *str,
|
2019-05-22 19:04:48 +02:00
|
|
|
core_yy_extra_type *yyext,
|
|
|
|
const ScanKeywordList *keywordlist,
|
|
|
|
const uint16 *keyword_tokens);
|
2009-11-09 19:38:48 +01:00
|
|
|
extern void scanner_finish(core_yyscan_t yyscanner);
|
2019-05-22 19:04:48 +02:00
|
|
|
extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp,
|
|
|
|
core_yyscan_t yyscanner);
|
2009-11-09 19:38:48 +01:00
|
|
|
extern int scanner_errposition(int location, core_yyscan_t yyscanner);
|
2015-03-26 19:03:19 +01:00
|
|
|
extern void scanner_yyerror(const char *message, core_yyscan_t yyscanner) pg_attribute_noreturn();
|
2009-11-09 19:38:48 +01:00
|
|
|
|
Phase 2 of pgindent updates.
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
2017-06-21 21:18:54 +02:00
|
|
|
#endif /* SCANNER_H */
|