rebol document

Chapter 15 - Parsing

REBOL/Core Users Guide
Main Table of Contents
Send Us Feedback

Contents:

1. Overview
2. Simple Splitting
3. Grammar Rules
4. Skipping Input
5. Match Types
6. Recursive Rules
7. Evaluation
7.1 Return Value
7.2 Expressions in Rules
7.3 Copying the Input
7.4 Marking the Input
7.5 Modifying the String
7.6 Using Objects
7.7 Debugging
8. Dealing with Spaces
9. Parsing Blocks and Dialects
9.1 Matching Words
9.2 Matching Datatypes
9.3 Characters Not Allowed
9.4 Dialect Examples
9.5 Parsing Sub-blocks
10. Summary of Parse Operations
10.1 General Forms
10.2 Specifying Quantity
10.3 Skipping Values
10.4 Getting Values
10.5 Using Words
10.6 Value Matches (block parsing only)
10.7 Datatype Words

1. Overview

Parsing splits a sequence of characters or values into smaller parts. It can be used for recognizing characters or values that occur in a specific order. In addition to providing a powerful, readable, and maintainable approach to regular expression pattern matching, parsing enables you to create your own custom languages for specific purposes.

The parse function has the general form:

parse series rules

The series argument is the input that is parsed and can be a string or a block. If the argument is a string, it is parsed by character. If the argument is a block, it is parsed by value.

The rules argument specifies how the series argument is parsed. The rules argument can be a string for simple types of parsing or a block for sophisticated parsing.

The parse function also accepts two refinements: /all and /case. The /all refinement parses all the characters within a string, including all delimiters, such as space, tab, newline, comma, and semicolon. The /case refinement parses a string based on case. When /case is not specified, upper and lower cases are treated the same.

2. Simple Splitting

A simple form of parse is for splitting strings:

parse string none

The parse function splits the input argument, string, into a block of multiple strings, breaking each string wherever it encounters a delimiter, such as a space, tab, newline, comma, or semicolon. The none argument indicates that no other delimiters other than these. For example:

probe parse "The trip will take 21 days" none
["The" "trip" "will" "take" "21" "days"]

Similarly,

probe parse "here there,everywhere; ok" none
["here" "there" "everywhere" "ok"]

In the example above, notice that the commas and semicolons have been removed from the resulting strings.

You can specify your own delimiters in the second argument to parse. For example, the following code parses a telephone number with dash (-) delimiters:

probe parse "707-467-8000" "-"
["707" "467" "8000"]

The next example uses equal (=) and double quote (") as the delimiters:

probe parse <IMG SRC="test.gif" WIDTH="123"> {="}
["IMG" "SRC" "test.gif" "WIDTH" "123"]

The next example parses a string based on commas only; any other delimiters are ignored. Consequently, the spaces within the strings are not removed:

Normally when you parse strings, any whitespace (space, tab, lines) are automatically processed as delimiters. To avoid that action, you can use the /any refinement. Compare these two examples:

parse "Test This" ""
["Test" "This"]
parse/all "Test This" ""
["Test This"]

In the second, you can see that the space was not treated as a delimiter.

Here is another example:

probe parse/all "Harry, 1011 Main St., Ukiah" ","
["Harry" " 1011 Main St." " Ukiah"]

You can also parse strings that contain null characters as separators (such as certain types of data files):

parse/all nulled-string "^(null)"

3. Grammar Rules

The parse function accepts grammar rules that are written in a dialect of REBOL. Dialects are sub-languages of REBOL that use the same lexical form for all datatypes, but allow a different ordering of the values within a block. Within this dialect the grammar and vocabulary of REBOL is altered to make it similar in structure to the well known BNF (Backus-Naur Form) which is commonly used to specify language grammars, network protocols, header formats, etc.

To define rules, use a block to specify the sequence of the inputs. For instance, if you want to parse a string and return the characters "the phone", you can use a rule:

parse string ["the phone"]

To allow any number of spaces or no spaces between the words, write the rule like this:

parse string ["the" "phone"]

You can indicate alternate rules with a vertical bar (|). For example:

["the" "phone" | "a" "radio"]

accepts strings that match any of the following:

the phone
a radio

A rule can contain blocks that are treated as sub-rules. The following line:

[ ["a" | "the"] ["phone" | "radio"] ]

accepts strings that match any of the following:

a phone
a radio
the phone
the radio

For increased readability, write the sub-rules as a separate block and give them a name to help indicate their purpose:

article: ["a" | "the"]
device: ["phone" | "radio"]
parse string [article device]

In addition to matching a single instance of a string, you can provide a count or a range that repeats the match. The following example provides a count:

[3 "a" 2 "b"]

which accepts strings that match:

aaabb

The next example provides a range:

[1 3 "a" "b"]

which accepts strings that match any of the following:

ab aab aaab

The starting point of a range can be zero, meaning that it is optional.

[0 3 "a" "b"]

accepts strings that match any of the following:

b ab aab aaab

Use some to specify that one or more characters are matched. Use any to specify that zero or more characters are matched. For example, some used in the following line:

[some "a" "b"]

accepts strings that contain one or more characters a and b:

ab aab aaab aaaab

The next example uses any:

[any "a" "b"]

which accepts strings that contain zero or more characters a or b:

b ab aab aaab aaaab

The words some and any can also be used on blocks. For example:

[some ["a" | "b"]]

accepts strings that contain any combination of the characters a and b.

Another way to express that a character is optional is to provide an alternate choice of none:

["a" | "b" | none]

This example accepts strings that contain a or b or none.

The none is useful for specifying optional patterns or for catching error cases when no pattern matches.

4. Skipping Input

The skip, to, and thru words allow input to be skipped.

Use skip to skip a single character, or use it with a repeat to skip over multiple characters:

["a" skip "b"]
["a" 10 skip "b"]
["a" 1 10 skip]

To skip until a specific character is found, use to:

["a" to "b"]

The previous example starts parsing at a and ends at b but does not include b.

To include the ending character, use thru:

["a" thru "b"]

The previous example starts parsing at a, ends at b, and includes b.

The following rule finds the title of an HTML page and prints it:

page: read http://www.rebol.com/
parse page [thru <title> copy text to </title>]
print text
REBOL Technologies

The first thru finds the title tag and goes immediately past it. Next, the input string is copied into a variable called text until the ending tag is found (but it doesn't go past it, or the text would include the tag).

5. Match Types

When parsing strings, these datatypes and words can be used to match characters in the input string:

Match Type

Description

"abc"

match the entire string

#"c"

match a single character

tag

match a tag string

end

match to the end of the input

(bitset)

match any specified char in the set

To use all of these words (except bitset, which is explained below) in a single rule, use:

[<B> ["excellent" | "incredible"] #"!" </B> end]

This example parses the input strings:

<B>excellent!</B>
<B>incredible!</B>

The end specifies that nothing follows in the input stream. The entire input has been parsed. It is optional depending on whether the parse function's return value is to be checked. Refer to the Evaluation section below for more information.

The bitset datatype deserves more explanation. Bitsets are used to specify collections of characters in an efficient manner. The charset function enables you to specify individual characters or ranges of characters. For example, the line:

digit: charset "0123456789"

defines a character set that contains digits. This allows rules like:

[3 digit "-" 3 digit "-" 4 digit]

which can parse phone numbers of the form:

707-467-8000

To accept any number of digits, it is common to write the rule:

digits: [some digit]

A character set can also specify ranges of characters. For instance, the digit character set could have be written as:

digit: charset [#"0" - #"9"]

Alternatively, you can combine specific characters and ranges of characters:

the-set: charset ["+-." #"0" - #"9"]

To expand on this, here is the alphanumeric set of characters:

alphanum: charset [#"0" - #"9" #"A" - #"Z" #"a" - #"z"]

Character sets can also be modified with the insert and remove functions, or combinations of sets can be created with the union and intersect functions. This line copies the digit character set and adds a dot to it:

digit-dot: insert copy digit "."

The following lines define useful character sets for parsing:

digit: charset [#"0" - #"9"]
alpha: charset [#"A" - #"Z" #"a" - #"z"]
alphanum: union alpha digit

6. Recursive Rules

Here is an example of rule set that parses mathematical expressions and gives a precedence (a priority) to the math operators used:

expr:    [term ["+" | "-"] expr | term]
term:    [factor ["*" | "/"] term | factor]
factor:  [primary "**" factor | primary]
primary: [some digit | "(" expr ")"]
digit:   charset "0123456789"

Now we can parse many types of math expressions. The following examples return true, indicating that the expressions were valid:

probe parse "1 + 2 * ( 3 - 2 ) / 4" expr
true
probe parse "4/5+3**2-(5*6+1)" expr
true

Notice in the examples that some of the rules refer to themselves. For instance, the expr rule includes expr. This is a useful technique for defining repeating sequences and combinations. The rule is recursive --it refers to itself.

When using recursive rules, care is required to prevent endless recursion. For instance:

expr: [expr ["+" | "-"] term]

creates an infinite loop because the first thing expr does is use expr again.

7. Evaluation

Normally, you parse a string to produce some result. You want to do more than just verify that the string is valid, you want to do something as it is parsed. For instance, you may want to pick out substrings from various parts of the string, create blocks of related values, or compute a value.

7.1 Return Value

The examples in previous chapters showed how to parse strings, but no results were produced. This is only done to verify that a string has the specified grammar; the value returned from parse indicates its success. The following examples show this:

probe parse "a b c" ["a" "b" "c"]
true
probe parse "a b" ["a" "c"]
false

The parse function returns true only if it reaches the end of the input string. An unsuccessful match stops the parse of the series. If parse runs out of values to search for before reaching the end of the series, it does not traverse the series and returns false :

probe parse "a b c d" ["a" "b" "c"]
false
probe parse "a b c d" [to "b" thru "d"]
true
probe parse "a b c d" [to "b" to end]
true

7.2 Expressions in Rules

Within a rule, you can include a REBOL expression to be evaluated when parse reaches that point in the rule. Parentheses are used to indicate such expressions:

string: "there is a phone in this sentence"
probe parse string [
    to "a"
    to "phone" (print "found phone")
    to end
]
found phone
true

The example above parses the string a phone and prints the message found phone after the match is complete. If the strings a or phone are missing and the parse can not be done, the expression is not evaluated.

Expressions can appear anywhere within a rule, and multiple expressions can occur in different parts of a rule. For instance, the following code prints different strings depending on what inputs were found:

parse string [
    "a" | "the"
    to "phone" (print "answer") |
    to "radio" (print "listen") |
    to "tv"    (print "watch")
]
answer
string: "there is the radio on the shelf"

parse string [
    "a" | "the"
    to "phone" (print "answer") |
    to "radio" (print "listen") |
    to "tv"    (print "watch")
]
listen

Here is an example that counts the number of times the HTML pre-format tag appears in a text string:

count: 0
page: read http://www.rebol.com/docs/dictionary.html
parse page [any [thru <pre> (count: count + 1)]]
print count
777

7.3 Copying the Input

The most common action done with parse is to pick up parts of the string being parsed. This is done with copy, and it is followed by the name of a variable to which you want to copy the string. The following example parses the title of a web page:

parse page [thru <title> copy text to </title>]
print text
REBOL/Core Dictionary

The example works by skipping over text until it finds the <title> tag. That's where it starts making a copy of the input stream and setting a variable called text to hold it. The copy operation continues until the closing <title> tag is found.

The copy action also can be used with entire rule blocks. For instance, for the rule:

[copy heading ["H" ["1" | "2" | "3"]]

the heading string contains the entire H1, H2, or H3 string. This also works for large multi-block rules.

7.4 Marking the Input

The copy action makes a copy of the substring that it finds, but that is not always desirable. In some cases, it is better to save the current position of the input stream in a variable.

NOTE: The copy word as used in parse is different from the copy function used in REBOL expressions. Parse uses a dialect of REBOL, and copy has a different meaning within that dialect.

In the following example, the begin variable holds a reference to the page input string just after <title>. The ending refers to the page string just before . These variables can be used in the same way as they would be used with any other series.

parse page [
    thru <title> begin: to </title> ending:
    (change/part begin "Word Reference Guide" ending)
]

You can see the above parse expression actually changed the contents of the title:

parse page [thru <title> copy text to </title>]
print text
Word Reference Guide

Here is another example that marks the position of every table tag in an HTML file:

page: read http://www.rebol.com/index.html
tables: make block! 20
parse page [
    any [to "<table" mark: thru ">"
        (append tables index? mark)
    ]
]

The tables block now contains the position of every tag:

foreach table tables [
    print ["table found at index:" table]
]
table found at index: 836
table found at index: 2076
table found at index: 3747
table found at index: 3815
table found at index: 4027
table found at index: 4415
table found at index: 6050
table found at index: 6556
table found at index: 7229
table found at index: 8268

NOTE: The current position in the input string can also be modified. The next section explains how this is done.

7.5 Modifying the String

Now that you know how to obtain the position of the input series, you also can use other series functions on it, including insert, remove, and change. To write a script that replaces all question marks (?) with exclamation marks (!), write:

str: "Where is the turkey? Have you seen the turkey?"
parse str [some [to "?" mark: (change mark "!") skip]]
print str
Where is the turkey! Have you seen the turkey!

The skip at the tail advances the input over the new character, which is not necessary in this case, but it is a good practice.

As another example, to insert the current time everywhere the word time appears in some text, write:

str: "at this time, I'd like to see the time change"
parse str [
    some [to "time"
        mark:
        (remove/part mark 4  mark: insert mark now/time)
        :mark
    ]
]
print str
at this 14:42:12, I'd like to see the 14:42:12 change

Notice the :mark word used above. It sets the input to a new position. The insert function returns the new position just past the insert of the current time. The word :mark is used to set the input to that position.

7.6 Using Objects

When parsing large grammar from a set of rules, variables are used to make the grammar more readable. However, the variables are global and may become confused with other variables that have the same name somewhere else in the program.

The solution to this problem is to use an object to make all the rule words local to a context. For instance:

tag-parser: make object! [
    tags: make block! 100
    text: make string! 8000
    html-code: [
        copy tag ["<" thru ">"] (append tags tag) |
        copy txt to "<" (append text txt)
    ]
    parse-tags: func [site [url!]] [
        clear tags clear text
        parse read site [to "<" some html-code]
        foreach tag tags [print tag]
        print text
    ]
]
tag-parser/parse-tags http://www.rebol.com

7.7 Debugging

As rules are written, there are times debugging is needed. Specifically, you may want to know how far you got in the parsing of a rule.

The trace function can be used to watch the parse operation progress, but this can output thousands of lines that are difficult to review.

A better way is to insert debugging expressions into the parse rules. As an example, to debug the rule:

[to "<IMG" "SRC" "=" filename ">"]

insert a the print function after key sections to monitor your progress through the rule:

[to "<IMG" (print 1) "SRC" "=" (print 2)
    filename (print 3) ">"]

This example prints 1, 2, and 3 as the rule is processed.

Another approach is to print out part of the input string as the parse happens:

[
   to "<IMG" here: (print here)
   "SRC" "=" here: (print here)
    filename here: (print here) ">"
]

If this is done often, you can create a rule for it:

here: [where: (print where)]

[
   to "<IMG" here
   "SRC" "=" here
    filename here ">"
]

The copy function can also be used to indicate what substrings were parsed as the rule was handled.

8. Dealing with Spaces

The parse function normally ignores all intervening whitespace between patterns that it scans. For instance, the rule:

["a" "b" "c"]

returns strings that match:

abc
a bc
ab c
a b c
a  b  c

and other similarly spaced combinations.

To enforce a specific spacing convention, use parse with the /all refinement. In the preceeding example, this refinement causes parse to only match the first case (abc).

parse/all "abc" ["a" "b" "c"]

Specifying the / all refinement forces every character in the input stream to be dealt with, including the default delimiters, such as space, tab, newline.

To handle spaces in your rules, create a character set that specifies the valid space characters:

spacer: charset reduce [tab newline #" "]

If you want a single space character between each letter write:

["a" spacer "b" spacer "c"]

To allow multiple space characters, write:

spaces: [some spacer]
["a" spaces "b" spaces "c"]

For more sophisticated grammars, create a character set that lets you scan a string up to a space character.

non-space: complement spacer
to-space: [some non-space | end]
words: make block! 20
parse/all text [
    some [copy word to-space (append words word) spacer]
]

The preceding example builds a block of all of its words. The complement function inverts the character set. Now it contains everything except the spacing characters you defined earlier. The non-space character set contains all characters except space characters. The to-space rule accepts one or more characters up to a space character or the end of the input stream. The main rule expects to begin with a word, copy that word up to a space, then skip the space character and begin the next word.

9. Parsing Blocks and Dialects

Blocks are parsed similar to strings. A set of rules specify the order of expected values. However, unlike the parsing of strings, the parsing of blocks is not concerned with characters or delimiters. Parsing of blocks is done at the value level, making the grammar rules easier to specify and operation many times faster.

Block parsing is the easiest way to create REBOL dialects. Dialects are sub-languages of REBOL that use the same lexical form for all data types but allow a different ordering of the values within a block. The values do not need to conform to the normal order required by REBOL function arguments. Dialects are able to provide greater expressive power for specific domains of use. For instance, the parser rules themselves are specified as a dialect.

9.1 Matching Words

When parsing a block, to match against a word specify the word as a literal:

'name
'when
'empty

9.2 Matching Datatypes

You can match a value of any datatype by specifying the data type word. See Datatype Matches below.

Datatype Word

Description

string!

matches any quoted string

time!

matches any time

date!

matches any date

tuple!

matches any tuple

NOTE: Don't forget the "!" that is part of the name or an error will be generated.

9.3 Characters Not Allowed

The parse operations allowed for blocks are those that deal with specific characters. For instance, a match cannot be specified to the first letter of a word or string, nor to spacing or newline characters.

9.4 Dialect Examples

A few concise examples help illustrate the parsing of blocks:

block: [when 10:30]
print parse block ['when 10:30]
print parse block ['when time!]
parse block ['when set time time! (print time)]

Notice that a specific word can be matched by using its literal word in the rule (as in the case of 'when ). A datatype can be specified rather than a value, as in the lines above containing time!. In addition, a variable can be set to a value with the set operation.

As with strings, alternate rules can be specified when parsing blocks:

rule: [some [
    'when set time time! |
    'where set place string! |
    'who set persons [word! | block!]
]]

These rules allow information to be entered in any order:

parse [
    who Fred
    where "Downtown Center"
    when 9:30
] rule
print [time place persons]

This example could have used variable assignment, but it illustrates how to provide alternate input ordering.

Here's another example that evaluates the results of the parse:

rule: [
    set count integer!
    set str string!
    (loop count [print str])
]
parse [3 "great job"] rule
parse [3 "hut" 1 "hike"] [some rule]

Finally, here is a more advanced example:

rule: [
    set action ['buy | 'sell]
    set number integer!
    'shares 'at
    set price money!
    (either action = 'sell [
            print ["income" price * number]
            total: total + (price * number)
        ][
            print ["cost" price * number]
            total: total - (price * number)
        ]
    )
]

total: 0
parse [sell 100 shares at $123.45] rule
print ["total:" total]

total: 0
parse [
    sell 300 shares at $89.08
    buy  100 shares at $120.45
    sell 400 shares at $270.89
] [some rule]
print ["total:" total]

It should be noted that this is one way how expressions that use the dialect concept first described in Chapter 4 can be evaluated.

9.5 Parsing Sub-blocks

When parsing a block, if a sub-block is found, it is treated as a single value that is of the block! datatype. However, to parse a sub-block, you must invoke the parser recursively on the sub-block. The into word provides this capability. It expects that the next value in the input block is a sub-block to be parsed. This is as if a block! datatype had been provided. If the next value is not a block! datatype, the match fails and into looks for alternates or exits the rule. If the next value is a block, the parser rule that follows the into word is used to begin parsing the sub-block. It is processed in the same way as a sub-rule.

rule: [date! into [string! time!]]
data: [10-Jan-2000 ["Ukiah" 10:30]]
print parse data rule

All of the normal parser operations can be applied to into.

rule: [
    set date date!
    set info into [string! time!]]
]
data: [10-Jan-2000 ["Ukiah" 10:30]]
print parse data rule

print info

rule: [date! copy items 2 into [string! time!]]
data: [10-Jan-2000 ["Ukiah" 10:30] ["Rome" 2:45]]
print parse data rule

probe items

10. Summary of Parse Operations

10.1 General Forms

Operator

Description

|

alternate rule

[block]

sub-rule

(paren)

evaluate a REBOL expression

10.2 Specifying Quantity

Operator

Description

none

match nothing

opt

zero or one time

some

one or more times

any

zero or more times

12

repeat pattern 12 times

1 12

repeat pattern 1 to 12 times

0 12

repeat pattern 0 to 12 times

10.3 Skipping Values

Operator

Description

skip

skip a value (or multiple if repeat given)

to

advance input to a value or datatype

thru

advance input thru a value or datatype

10.4 Getting Values

Operator

Description

set

set the next value to a variable

copy

copy the next match sequence to a variable

10.5 Using Words

Operator

Description

word

look-up value of a word

word:

mark the current input series position

:word

set the current input series position

'word

matches the word literally (parse block)

10.6 Value Matches (block parsing only)

Operator

Description

"fred"

matches the string "fred"

%data

matches the file name %data

10:30

matches the time 10:30

1.2.3

matches the tuple 1.2.3

10.7 Datatype Words

Word

Description

type!

matches anything of a given datatype


Updated 15-Apr-2005 - Copyright REBOL Technologies - Formatted with MakeDoc2
REBOL.com Documents Manual Dictionary Library Feedback