csstree | CSS Ecosystem Directory

Bot releases are visible (Hide)

csstree - 1.0.0-alpha.36

Published by lahmatiy about 5 years ago

Dropped support for Node < 8
Updated dev deps (fixed npm audit issues)
Reworked build pipeline
- Package provides dist/csstree.js and dist/csstree.min.js now (instead of single dist/csstree.js that was a min version)
- Bundle size (min version) reduced from 191Kb to 158Kb due to some optimisations
Definition syntax
- Renamed grammar into definitionSyntax (named per spec)
- Added compact option to generate() method to avoid formatting (spaces) when possible
Lexer
- Changed dump() method to produce syntaxes in compact form by default

csstree - 1.0.0-alpha.35 Improvements on walkers

Published by lahmatiy about 5 years ago

Walker
- Changed implementation to avoid runtime compilation due to CSP issues (see #91, #109)
- Added find(), findLast() and findAll() methods (e.g. csstree.find(ast, node => node.type === 'ClassSelector'))

csstree - v1.0.0-alpha.31 Syntax matching improvements and fixes

Published by lahmatiy about 5 years ago

This release improves syntax matching by new features and some fixes.

Bracketed range notation

A couple month ago bracketed range notation was added to Values and Units spec. The notation allows restrict numeric values to some range. For example, <integer [0,∞]> is for positive integers, or <number [0,1]> that can be used for an alpha value.

Since the notation is new thing in syntax definition, it isn't used in specs yet. However, there is a PR (https://github.com/w3c/csswg-drafts/pull/3894) that will bring it to some specs. And CSSTree is ready for this.

Right now, the notation helped to remove <number-zero-one>, <number-one-or-greater> and <positive-integer> from generic types and define them using a regular grammar (thanks to the notation).

Low priority type matching

There are at least two productions that has a low priority in matching. It means that such productions give a chance for other production to claim a token, and if no one – claim a token. This release introduce a solution for such productions. It's hardcoded at the moment, but can be exposed if needed (i.e. if there are more such productions).

First production is <custom-ident>. The Values and Units spec states:

When parsing positionally-ambiguous keywords in a property value, a <custom-ident> production can only claim the keyword if no other unfulfilled production can claim it.

This rule takes place in properties like <'animation'>, <'transition'> and <'list-style'>. Before solves in different ways:

<'animation'> – that's not an issue since <'custom-ident'> goes last, however a terms order can be changed in the future
<'transition'> – there was a patch for <single-transition> that changes order of terms
<'list-style'> – had no fixes, just didn't work in some cases (see #101)

And now, all those and the rest syntaxes work as expected.

Second production is a bit tricky. It's about "unitless zero" for <length> production. The spec states:

... if a 0 could be parsed as either a <number> or a <length> in a property (such as line-height), it must parse as a <number>.

This rule takes place in properties like <'line-height'> or <'flex'>. And now it works per spec too (try it here):

Changes

Bumped mdn/data to 2.0.4 (#99)
Lexer
- Added bracketed range notation support and related refactoring
- Removed <number-zero-one>, <number-one-or-greater> and <positive-integer> from generic types. In fact, types moved to patch, because those types can be expressed in a regular grammar due to bracketed range notation implemented
- Added support for multiple token string matching
- Improved <custom-ident> production matching to claim the keyword only if no other unfulfilled production can claim it (#101)
- Improved <length> production matching to claim "unitless zero" only if no other unfulfilled production can claim it
- Changed lexer's constructor to prevent generic types override when used
- Fixed large ||- and &&-group matching, matching continues from the beginning on term match (#85)
- Fixed checking that value has var() occurrences when value is a string (such values can't be matched on syntax currently and fail with specific error that can be used for ignorance in validation tools)
- Fixed <declaration-value> and <any-value> matching when a value contains a function, parentheses or braces

csstree - 1.0.0-alpha.34

Published by lahmatiy about 5 years ago

Tokenizer
- Added isBOM() function
- Added charCodeCategory() function
- Removed firstCharOffset() function (use isBOM() instead)
- Removed CHARCODE dictionary
- Removed INPUT_STREAM_CODE* dictionaries
Lexer
- Allowed comments in matching value (just ignore them like whitespaces)
- Increased iteration count in value matching from 10k up to 15k
- Fixed missed debugger (#104)

csstree - 1.0.0-alpha.33 Yet another fix

Published by lahmatiy over 5 years ago

Lexer
- Fixed low priority productions matching by changing an approach for robust one (#103)

csstree - 1.0.0-alpha.32 Hot fix

Published by lahmatiy over 5 years ago

Changes

Lexer
- Fixed low priority productions matching in long ||- and &&- groups (#103)

csstree - 1.0.0-alpha.30 Tokens rule: reworked tokenizer, syntax matching switched to rely on tokens

Published by lahmatiy over 5 years ago

This release took too many time to be released. But it was worth the wait, because it unlocks new possibilities and ways for further improvements.

Reworked tokenizer

CSSTree tends to be as close as possible to the specifications in reasonable way. It means that CSSTree deviates from specs because specs are generally targeted for user agents (browsers) rather than source processing tools like CSSTree.

Previously CSSTree's tokenizer used its own token types set, which were selected for better performance and to be convenient enough for building AST. However, this has restricted the further improvement of parser, lexer and even generator, since the basis of CSS is tokens. That's not obvious at first glance, but if you dig deep into specs you'll find that CSS syntax is described in tokens and their productions, serialization relay on tokens, even var() substitution takes place at the level of tokens and so on. Using own token types set means that many rules described in CSS specs can't be implemented as designed. That's why previously CSSTree's tokenizer was actually too far from specs.

In this release tokenizer was reworked to use token type set defined by CSS Syntax Module Level 3. Algorithms described by spec was adopted by tokenizer implementation and code is provided with excerpts from the specification. It allowed to be very close to spec and helped to fix numerous edge cases.

Current deviations from the CSS Syntax Module Level 3:

No input preprocessing currently. It not a problem actually, since CSS processing tools usually do not do any preprocessing, and looks like it's fine. However, it can be added later via additional option to tokenizer and parser.
No comments removal. According to spec tokenizer should not produce tokens for comments, or otherwise preserve them in any way. But comments are useful for source processing tools, so it looks reasonable to keep it as a comment token. Probably this will change in the future.

Influence on parser

Changing the token types set led to a significant alteration of parser implementation. Most dramatic changes in AnPlusB and UnicodeRange implementations, because those two microsyntaxes are really hard. Nevertheless, in general, most things became simpler. Also parser continues relaxing on parse stage, more delegating syntax checking to lexer. As a result some parsing errors are no longer occur, so tools using CSSTree have a chance to use AST even for partially invalid CSS.

This release doesn't change AST format. However, the format will be changing for sure in next releases to be closer to token type set. It will reduce more parse errors and increase tools possibilities.

Lexer

Lexer was slightly refactored. Most significant change, syntax matching relies on real CSS tokens produced by a tokenizer rather than generated from AST tokens. In other words, AST is translating to a string and then splitting into tokens by the tokenizer. Consequences of this:

Since AST is not used directly for token producing and syntax matching, it became completely optional.
A string can be used as a value for matching (i.e. lexer.matchProperty('border', 'red 1px dotted')). So parsing into AST is not required anymore, and that's a good news for tools which using CSSTree for a validation and have another AST format or have no AST at all.
Types that is using tokens in their syntax is now can be used for matching. Such syntaxes was omitted from mdn/data by CSSTree's patch recently. Fortunately, it is no longer needed (difference with mdn/data).

Work on lexer is not completed yet. This version removes some restrictions and its ready for further improvements like at-rules and selectors matching, better mathematical expressions (calc() and friends) support, attr()/toggle()/var() fallback checking, multiple errors, suggestions, improving matching performance and so on.

Change log (commits)

Bumped mdn/data to ~2.0.3
- Removed type removals from mdn/data due to lack of some generic types and specific lexer restictions (since lexer was reworked, see below)
- Reduced and updated patches
Tokenizer
- Reworked tokenizer itself to compliment CSS Syntax Module Level 3
- Tokenizer class splitted into several abstractions:
  - Added TokenStream class
  - Added OffsetToLocation class
  - Added tokenize() function that creates TokenStream instance for given string or updates a TokenStream instance passed as second parameter
  - Removed Tokenizer class
- Removed Raw token type
- Renamed Identifier token type to Ident
- Added token types: Hash, BadString, BadUrl, Delim, Percentage, Dimension, Colon, Semicolon, Comma, LeftSquareBracket, RightSquareBracket, LeftParenthesis, RightParenthesis, LeftCurlyBracket, RightCurlyBracket
- Replaced Punctuator with Delim token type, that excludes specific characters with its own token type like Colon, Semicolon etc
- Removed findCommentEnd, findStringEnd, findDecimalNumberEnd, findNumberEnd, findEscapeEnd, findIdentifierEnd and findUrlRawEnd helper function
- Removed SYMBOL_TYPE, PUNCTUATION and STOP_URL_RAW dictionaries
- Added isDigit, isHexDigit, isUppercaseLetter, isLowercaseLetter, isLetter, isNonAscii, isNameStart, isName, isNonPrintable, isNewline, isWhiteSpace, isValidEscape, isIdentifierStart, isNumberStart, consumeEscaped, consumeName, consumeNumber and consumeBadUrlRemnants helper functions
Parser
- Changed parsing algorithms to work with new token type set
- Changed HexColor consumption in way to relax checking a value, i.e. now value is a sequence of one or more name chars
- Added & as a property hack
- Relaxed var() parsing to only check that a first arguments is an identifier (not a custom property name as before)
Lexer
- Reworked syntax matching to relay on token set only (having AST is optional now)
- Extended Lexer#match(), Lexer#matchType() and Lexer#matchProperty() methods to take a string as value, beside AST as a value
- Extended Lexer#match() method to take a string as a syntax, beside of syntax descriptor
- Reworked generic types:
  - Removed <attr()>, <url> (moved to patch) and <progid> types
  - Added types:
    - Related to token types: <ident-token>, <function-token>, <at-keyword-token>, <hash-token>, <string-token>, <bad-string-token>, <url-token>, <bad-url-token>, <delim-token>, <number-token>, <percentage-token>, <dimension-token>, <whitespace-token>, <CDO-token>, <CDC-token>, <colon-token>, <semicolon-token>, <comma-token>, <[-token>, <]-token>, <(-token>, <)-token>, <{-token> and <}-token>
    - Complex types: <an-plus-b>, <urange>, <custom-property-name>, <declaration-value>, <any-value> and <zero>
  - Renamed <unicode-range> to <urange> as per spec
  - Renamed <expression> (IE legacy extension) to <-ms-legacy-expression> and may to be removed in next releases

csstree - 1.0.0-alpha.29 New syntax matching approach

Published by lahmatiy over 6 years ago

A brand new syntax matching

This release brings a brand new syntax matching approach. The syntax matching is important feature that allow CSSTree to provide a meaning of each component in a declaration value, e.g. which component of a declaration value is a color, a length and so on. You can see example of matching result on CSSTree's syntax reference page:

Syntax matching is now based on CSS tokens and uses a state machine approach which fixes all problems it has before (see https://github.com/csstree/csstree/issues/67 for the list of issues).

Token-based matching

Previously syntax matching was based on AST nodes. Beside it possible to make syntax matching such way, it has several disadvantages:

Synchronising of CSS parsing result (AST) and syntax description tree traverses is quite complicated:
- Every tree represents different things: one node type set for CSS parsing result and another one for syntax description tree
- Some AST nodes consist of several tokens and contain children nodes
Some AST nodes doesn't contain symbols that will be in output on AST translating to string. For instance, Function node contains a function name and a list of children, but it also produce parentheses that isn't store in AST. This introduces many hacks and workarounds. However, it was not enough since approach doesn't work for nodes like Brackets. Also it forces matching algorithm to know a lot of about node types and their features.

Starting this release, AST (CSS parse result) is converting to a token stream before matching (using CSSTree's generator with a special decorator function). Syntax description tree is also converting into so called Match graph (see details below). Those tree transformations allow to align both tree to work in the same terms – CSS tokens.

This change make matching algorithm much simpler. Now it know nothing about AST structure, hacks and workarounds were removed. Moreover, syntaxes like <line-names> (contains brackets) and <calc()> (contains operators in nested syntaxes) are now can be matched (previously syntax matching failed for them).

Update syntax AST format

Since syntax matching moved from AST nodes to CSS tokens, syntax description tree format was also changed. For instance, functions is now represented as a token sequence. It allows to handle syntaxes that contains a group with several function tokens inside, like this one:

<color-adjuster> =
    [red( | green( | blue( | alpha( | a(] ['+' | '-']? [<number> | <percentage>] ) |
    [red( | green( | blue( | alpha( | a(] '*' <percentage> ) |
    ...

Despite that<color-mod()> syntax was recently removed from CSS Color Module Level 4, such syntaxes can appear in future, since valid (even looks odd).

As the result of format changes, all syntaxes in mdn/data can now be parsed, even invalid from the standpoint of CSS Values and Units Module Level 3 spec syntaxes. Due to this, some errors in syntaxes were found and fixed (https://github.com/mdn/data/pull/221, https://github.com/mdn/data/pull/226). Also some suggestions on syntax optimisation were made (https://github.com/mdn/data/pull/223, https://github.com/mdn/data/issues/230).

Introducing Match graph

As mentioned above, syntax tree is now transforming to Match graph. This happens on first match for a syntax and then reused. Match graph represents a graph of simple actions (states) and transitions between them. Some complicated thing, like multipliers, are translating in a set of nodes and edges. You can explore which a match graph is building for any syntax on CSSTree's syntax reference page, e.g. the match graph for <'animation-name'>:

There were some challenges during implementation, most notable of them:

&&- and ||- groups. Actually it was a technical blocker that suspended moving to match graph. Finally, a solution was found: split a groups in smaller one by removing a term one by one. For example, a && b && c can be represented as following (pseudo code):

if match a
  then [b && c]
  else if match b
    then [a && c]
    else if match c
      then [a && b]
      else MISMATCH

So, a size of groups is reducing by one on each step, then we process the smaller groups until a group consists of a single term.

a && b
=
if match a
  then if match b
    then MATCH
    else MISMATCH
  else if match b
    then if match a
      then MATCH
      else MISMATCH
    else MISMATCH

It works fine, but for small groups only. Since it produces at least N! (factorial) nodes, where N is a number of terms in a group. Hopefully, there are not so many syntaxes that contain a group with a big number of terms for &&- or ||- group. However, font-variant syntax contains a group of 20 terms, that means at least 2,432,902,008,176,640,000 nodes in a graph. It's huge and we can't create such number of object due a memory limit. So, alternative solution for groups greater than 5 terms was introduced, it uses special buffer and iterate terms in a loop. The solution is not ideal, but there are just 9 such groups (with 6 or more terms) across all syntaxes, so it should be ok for now.

A comma. The task turned out to be a tough nut to crack, because of specific rules. For example, if we have a syntax like that:

a?, b?, c?

We can match a, b, c, a, c, b, b, c and so on. But input like , b, c, a, , c or a, is not allowed. In other words, comma must not be hanged and must not be followed by an another comma. And when comma is matching to an input, it should notify a positive match even there is no a comma token in the input. This was a blocker that could cancel the whole approach.

Nevertheless, the problem was solved in elegant way, by checking adjacent tokens for a several patterns. It most non-trivial part of new syntax matching, several lines of code works well only with along other parts of implementation, so may looks like a magic.

Using state machine

Another improvement in syntax matching is replacing a recursion-based algorithm with a state machine approach. This allowed to check all possible alternatives during the syntax matching. Previously if nothing matched by a chosen path, algorithm just exited with a mismatch result. New algorithm is returning back to a branching point and choose an alternative path when possible. This fixes following:

Syntaxes with alternative paths, like <bg-position>.

<bg-position> =
    [ left | center | right | top | bottom | <length-percentage> ] |
    [ left | center | right | <length-percentage> ] [ top | center | bottom | <length-percentage> ] |
    [ center | [ left | right ] <length-percentage>? ] && [ center | [ top | bottom ] <length-percentage>? ]

This syntax didn't work before, since it defines shortest form first and matching fell in this path with no chance to use an alternative path. However, reverse order of groups in this syntax makes it work with old algorithm.

Another example is a new syntax for <rgb()>:

rgb() = rgb( <percentage>{3} [ / <alpha-value> ]? ) |
        rgb( <number>{3} [ / <alpha-value> ]? ) |
        rgb( <percentage>#{3} , <alpha-value>? ) |
        rgb( <number>#{3} , <alpha-value>? )

Old algorithm doesn't exit from a function content when matched a function, and can't handle such syntaxes. To make matching work for syntaxes like this one, an adoption is required (by a patch as workaround). Now patches are not required.

Matching for syntaxes not compatible with greedy algorithms. For instance, syntax of composes (CSS Modules) is defined as <custom-ident>+ from <string>, and old matching algorithm failed on it because from is a valid value for <custom-ident> and it's capturing by <custom-ident>+ with no alternatives. New algorithm is not greedy, on first try it takes a minimum count of tokens allowed by a syntax and increases that count if possible on each returning in the branching point. Syntaxes like composes can be matched now as well.

A state machine approach gives some other benefits like a precise error locations. Previously, location of a problem could be confusing:

SyntaxMatchError: Mismatch
  syntax: ...
   value: rgb(1,2)
  ------------^

And now it's more helpful:

SyntaxMatchError: Mismatch
  syntax: ...
   value: rgb(1,2)
  ---------------^

Further improvements on syntax matching can improve error handling and probably provide some sort of suggestions.

Performance

New syntax matching approach requires more memory and time, because of AST to token stream transformation and checking all possible alternatives. However, new approach is more effective itself and have a room for further optimisations. Usually it takes the same or ~50% more time (depending on syntax and a matching value) compared with previous algorithm. So that's not a big deal.

The main goal the release was make it all works, so not every possible optimisation were implemented and more will come in next releases.

Other changes

Lexer
- Syntax matching was completely reworked. Now it's token-based and uses state machine. Public API has not changed. However, some internal data structures have changed. Most significant change in syntax match result tree structure, it's became token-based instead of node-based.
- Grammar
  - Changed grammar tree format:
    - Added Token node type to represent a single code point (<delim-token>)
    - Added Multiplier that wraps a single node (term property)
    - Added AtKeyword to represent <at-keyword-token>
    - Removed Slash and Percent node types, they are replaced for a node with Token type
    - Changed Function to represent <function-token> with no children
    - Removed multiplier property from Group
  - Changed generate() method:
    - Method takes an options as second argument now (generate(node, forceBraces, decorator) -> generate(node, options)). Two options are supported: forceBraces and decorator
    - When a second parameter is a function it treats as decorate option value, i.e. generate(node, fn) -> generate(node, { decorate: fn })
    - Decorate function invokes with additional parameter – a reference to a node
Tokenizer
- Renamed Atrule const to AtKeyword

csstree - 1.0.0-alpha.28 Fixes

Published by lahmatiy over 6 years ago

Renamed lexer.grammar.translate() method into generate()
Fixed <'-webkit-font-smoothing'> and <'-moz-osx-font-smoothing'> syntaxes (#75)
Added vendor keywords for <'overflow'> property syntax (#76)
Pinned mdn-data to ~1.1.0 and fixed issues with some updated property syntaxes

csstree - 1.0.0-alpha.27 Rework generator and walker

Published by lahmatiy almost 7 years ago

Most of the changes of this release relate to rework of generator and walker. Instead of plenty methods there just single method for each one: generate() for the generator and walk() for the walker. Both methods take two arguments ast and options (optional for the generator). This makes API much simpler (see details about API in Translate AST to string and AST traversal):

Also List class API was extended, and some utils methods such as keyword() and property() were changed to be more useful.

Generator

Changed node's generate() methods invocation, methods now take a node as a single argument and context (i.e. this) that have methods: chunk(), node() and children()
Renamed translate() to generate() and changed to take options argument
Removed translateMarkup(ast, enter, leave) method, use generate(ast, { decorator: (handlers) => { ... }}) instead
Removed translateWithSourceMap(ast), use generate(ast, { sourceMap: true }) instead
Changed to support for children as an array

Walker

Changed walk() to take an options argument instead of handler, with enter, leave, visit and reverse options (walk(ast, fn) is still works and equivalent to walk(ast, { enter: fn }))
Removed walkUp(ast, fn), use walk(ast, { leave: fn })
Removed walkRules(ast, fn), use walk(ast, { visit: 'Rule', enter: fn }) instead
Removed walkRulesRight(ast, fn), use walk(ast, { visit: 'Rule', reverse: true, enter: fn }) instead
Removed walkDeclarations(ast, fn), use walk(ast, { visit: 'Declaration', enter: fn }) instead
Changed to support for children as array in most cases (reverse: true will fail on arrays since they have no forEachRight() method)

Misc

List
- Added List#forEach() method
- Added List#forEachRight() method
- Added List#filter() method
- Changed List#map() method to return a List instance instead of Array
- Added List#push() method, similar to List#appendData() but returns nothing
- Added List#pop() method
- Added List#unshift() method, similar to List#prependData() but returns nothing
- Added List#shift() method
- Added List#prependList() method
- Changed List#insert(), List#insertData(), List#appendList() and List#insertList() methods to return a list that performed an operation
Changed keyword() method
- Changed name field to include a vendor prefix
- Added basename field to contain a name without a vendor prefix
- Added custom field that contain a true when keyword is a custom property reference
Changed property() method
- Changed name field to include a vendor prefix
- Added basename field to contain a name without any prefixes, i.e. a hack and a vendor prefix
Added vendorPrefix() method
Added isCustomProperty() method

csstree - v1.0.0-alpha.26 Tolerant parsing by default

Published by lahmatiy almost 7 years ago

This journey started a couple months ago with 1.0.0-alpha20, which added tolerant parsing mode as experimental feature, available behind tolerant option. During 5 releases, the feature was tested on various data, numerous errors and edge cases were fixed. The last necessary changes were made in this release, which makes the feature ready for use. So, I proud to say, CSSTree parser is tolerant to errors by default now.

That's the significant change, and this meets CSS Syntax Module Level 3, which says:

When errors occur in CSS, the parser attempts to recover gracefully, throwing away only the minimum amount of content before returning to parsing as normal. This is because errors aren’t always mistakes - new syntax looks like an error to an old parser, and it’s useful to be able to add new syntax to the language without worrying about stylesheets that include it being completely broken in older UAs.

In other words, spec compliant CSS parser should be able to parse any text as a CSS with no errors. CSSTree is now such parser! 🎉

The only thing the CSSTree parser departs from the specification is that it doesn't throw away bad content, but wraps it in the Raw nodes, which allows processing it later. This discrepancy is due to the fact that the specification is written for UA that extract meaning from CSS, so incomprehensible parts simply do not make sense to them and can be ignored. CSSTree has a wider range of tasks, and most of them are related to the processing of the source code. These are tasks such as locating errors, error correction, preprocessing, and so on.

Tolerant mode means you don't need to wrap csstree.parse() into try/catch. To collect parse errors onParseError handler should be set in parse options:

var csstree = require('css-tree');

csstree.parse('I must! be tolerant to errors', {
    onParseError: function(e) {
        console.error(e.formattedMessage);
    }
});
// Parse error: Unexpected input
//     1 |I must! be tolerant to errors
// -------------^
// Parse error: LeftCurlyBracket is expected
//     1 |I must! be tolerant to errors
// ------------------------------------^

If you need old parser behaviour, just throw an exception inside onParseError handler, that immediately stops a parsing:

try {
    csstree.parse('I must! be tolerant to errors', {
        onParseError: function(e) {
            throw e;
        }
    });
} catch(e) {
    console.error(e.formattedMessage);
}
// Parse error: Unexpected input
//     1 |I must! be tolerant to errors
// -------------^