Grammar Question
Published 20 years, 1 month pastI was just recently asked if attribute selectors must use quotes around the value. In other words, are both the following two selectors legal?
a[href="http://www.meyerweb.com"] {font-weight: bold;} a[href=http://www.complexspiral.com] {font-style: italic;}
“No, they’re optional,” I said with assurance. And then the doubts started to gnaw at me. What if they actually weren’t, which might make sense given that you can require the exact match of a space-separated list of attribute values? By this, I mean that if you declare:
div[class="this is a test"] {color: orange;}
…then the selector will match any div
element whose class
attribute is exactly this is a test
, in that order, and with nothing else in the attribute value. Or so I’ve always been given to understand. In that case, if you left off the quotes, couldn’t that somehow be confusing to the browser? Maybe not, but it still bothered me.
So I went digging through CSS2.1, Appendix G and found the grammatical definition of an attribute selector.
attrib : '[' S* IDENT S* [ [ '=' | INCLUDES | DASHMATCH ] S* [ IDENT | STRING ] S* ]? ']' ;
You all understood that, right? Uh-huh. Me either. (This is one of those hostile-to-outsiders posts I mentioned a while back.)
I never liked grammar in school, and I still don’t, but it is sometimes sadly necessary. So here goes. When you run down the definitions of the all-caps words (I think those are tokens) you find that IDENT is an identifier, which is sort of a catch-all bin for things like selectors, property names, values, and such. Fine. STRING, on the other hand, is a collection of symbols and other fun stuff, again including the non-ASCII range but not the ASCII range. But then it includes the entirety of Unicode. I’m not sure how much sense that makes, but whatever.
So the whole point of this is: if quotes around an IDENT are optional, wouldn’t it have made more sense to say this?
attrib : '[' S* IDENT S* [ [ '=' | INCLUDES | DASHMATCH ] S* [ STRING? IDENT STRING? ] S* ]? ']' ;
Or even:
attrib : '[' S* IDENT S* [ [ '=' | INCLUDES | DASHMATCH ] S* [ '"'? IDENT '"'? ] S* ]? ']' ;
It’s the [ IDENT | STRING ]
from the original definition that has me befuddled. It seems like it’s saying you can include an IDENT or a STRING but not both, and since IDENT doesn’t include quotes, that implies that you can either drop in an identifier, or a string with quotes. Why is this a good idea? Does it mean that any identifier has to be unquoted? Does it mean that there’s no practical distinction between a STRING and an IDENT in this situation? Does it mean that quotes prevent the inclusion of anything useful? Somebody let me know.
Comments (16)
Does it mean that quotes prevent the inclusion of anything useful?
Nope. “String” does in fact include useful characters. Here’s the maze you have to run to figure that out:
1) string {string1}|{string2}
2) string1 “([\t !#$%&(-~]|\{nl}|’|{nonascii}|{escape})*”
3) escape {unicode}|\[ -~\200-\377]
4) <space>-~ <– this range includes all low-ascii characters that aren’t control characters, including letters, digits, lots of symbols, etc.
I noticed that the font that w3.org is using makes the ~ look like a -. I wouldn’t have realized what was going on except that it appears ever so slightly higher than the – it was sitting next to.
Hi Eric,
I hope I can help here a bit and perhaps thus pay back for some of the help I got from your books and columns.
STRING, in fact *does* include the ASCII range (or most of it). The key to understanding this is that little “-” hiding there, near the end of
[\t !#$%&(-~]
, in the definition of STRING. What it means is “any character whose code is between that of ‘(‘ (which is hex 28) and ‘~’ (which is hex 7E)”. This includes all the digits, lower and upper case letters, and some more.So if your string qualifies as an IDENT (roughly mening that it starts with a letter or underscore, and does not contain unescaped spaces or punctuation), you don’t have to quote it. Any other string, you do have to quote.
Note that every IDENT also qualifies as a string (when quoted). So you’re always allowed to quote, if you feel like it.
At least that’s the way I understand it.
Oops! Correction to what I wrote above. “escape” requires a backslash to precede it. The way you get the useful characters in there is from string1 (or string2) where is lists (-~ in the jumble of symbols. In this case, “-” denotes a range of characters, not a literal minus sign. The range (-~ includes almost everything in <space>-~, just not ! ” # $ % & and ‘, all of which (except the quote marks) are listed literally.
Antone – I’m afraid you took the wrong turn in step 3 – you followed the rule for escaped characters (which consist of a backslash followed by something).
(the other reason I’m posting this comment is to correct my e-mail address, which I mis-typed last time).
I don’t think ‘div[class=this is a test] {color: orange;}’, without quotes, is legal.
attrib : ‘[‘ S* IDENT S* [ [ ‘=’ | INCLUDES | DASHMATCH ] S* [ IDENT | STRING ] S* ]? ‘]’
After the ‘=’, you can have 0 or more whitespace characters followed by exactly one ident or string followed by 0 or more whitespace characters (notice that there’s no “*”, “+” or anything after [ IDENT | STRING ], so you have to have exactly one of those). If you put quotes around “this is a test”, you have one string. Without the quotes, you have 4 identifiers.
Alway one step behind you :-)
I should refresh more often.
Is there any particular reason you are quoting from CSS 2.1 and not CSS 2? The definition of attr is the same, and the 2.1 specification states:
Because CSS 2.1 is more based on real-world implementations and lessons learned from CSS 2. It is the better spec (especially as far as clarity of explanations is concerned), and the one most browsers aim to implement.
~Grauw
The flaw in the reasoning in the post is, as pointed out above, that STRING doesn’t allow non-ASCII characters. In fact, it allows all characters except for unescaped newlines, the type of quotation marks quoting the string (unescaped), and control characters other than TAB (i.e., all characters less than ASCII/Unicode 32 other than TAB, plus ASCII/Unicode 127).
That said, it should be possible to figure this out without looking at the grammar. The attribute selector part certainly is possible: 5.8.1 says “Attribute values must be identifiers or strings.” The definitions of identifiers and strings, though, should probably be clearer in chapter 4.
The
-~
bit pointed out by Antone seems to have been the missing key in terms of character-range. But why, then, offer a choice of a STRING or IDENT, when it seems like a simple STRING would suffice? I still don’t get that part. And then David said:Could you maybe translate that into English, chief? Start with explaining how STRING can disallow non-ASCII characters while also allowing all characters besides the exceptions you listed. Those seem very much like mutually exclusive statements.
Oops, I was trying to write a double negative and I forgot one of them. I meant “doesn’t disallow”.
And the reason simple STRING doesn’t suffice is that we want to allow
td[align=left]
in addition totd[align="left"]
, as long as the attribute value is reasonably simple (i.e., an identifier).Er, I did write a double negative, just not in the right two places. I meant “doesn’t disallow ASCII characters” (well, most of them). But I put a “non-” in front of “ASCII” instead of a “dis” in front of “allow”.
I should always read what I write to see any words out!
String vs. ident: The choice between string and ident allows you to skip the quotes if it is a single, simple word (no space, punctuation or non-ascii). So
class=abc
andclass="abc"
are both legal, whileclass="to be"
must be quoted. The historical reason is proably that html uses these rules. (Unlike xml/xhtml, where quotes/strings are always mandatory.)Strings do allow all characters, both ascii and non-ascii, with the newline/control char/quote exceptions.
“string doesn’t allow non-ascii characters” refers to “the flaw in the reasoning”, not to strings.
By the way, I don’t think “[ ‘”‘? IDENT ‘”‘? ]” means what you think it means, Eric. I think it means that EACH quote is optional, not BOTH quotes TAKEN TOGETHER. The subtle difference is illustrated here; with your syntax all of the following 4 are legal:
“hello” (quotes on both ends — you thought of this)
“hello (quote on left only)
hello” (quote on right only)
hello (no quotes at all — you thought of this)
I think what you were going for is
[ ‘”‘ IDENT ‘”‘ | IDENT ]
which indicates a choice: quotes-on or quotes-off, you supply both quotes or no quotes. No onesies.
Finally, telling people to “Remember to encode character entities if you’re posting markup examples!” without a preview button is rather “hostile-to-outsiders”, no?
Okay, it’s much more clear now—thanks to all.
J.B.: yep, you’re right, that’s more in line with what I meant to say, although I realized later that my formulation didn’t permit single-quote marks, and in turn neither did yours, but you were a lot closer to being correct than I was. As for the outsider-hostile character encoding, I figure that anyone who needs to encode markup will know how to do it; nobody else would really need to know anyway. One of these days I’ll probably add a preview function, but there are a lot of other things sitting much higher on my priority list. With any luck, WordPress will get a comment-preview function built in, and that will be that.
Is the title of a play underlined or in quotes? Thank you for your assistance.