more regular expressions on kile: greedy and non-greedy search

First let me point out a common misconception about wild card character “?” and “*”. As we correctly remembered, “?” means zero or one of some object, and “*” means zero or many of some object. However they are not stand-alone universal object, like in shell programming.These two wild cards must be preceded by another character or group of characters, as in the following example: sample text:

Qsignal3-5.pdf                                100%   30KB  30.2KB/s   00:00 Qsignal3-12.pdf                               100%   25KB  25.4KB/s   00:00 TVsignal3-2.pdf                               100%   31KB  30.5KB/s   00:00 TVsignal3-4.pdf                               100%   27KB  27.3KB/s   00:00 TVsignal3-6.pdf                               100%   17KB  17.2KB/s   00:00

Now say I want to enclose each *.pdf  in the tag \includegraphics{..}, to get \includegraphics{Qsignal3-5.pdf} \includegraphics{Qsignal3-12.pdf} \includegraphics{TVsignal3-2.pdf} \includegraphics{TVsignal3-4.pdf} \includegraphics{TVsignal3-6.pdf} One way is to search for: pdf.*100%.*\n and replace by: pdf\} \n \\includegraphics\{

A few technical points: the double backslashes is necessary since backslash is a special symbol, unlike the percent sign %; so are the brackets ‘{‘, ‘}’ and their siblings.

To understand these two regular expressions: the search query looks for a string starting with “pdf” followed by an arbitrary number of non-carriage characters, following by the string “100%” followed again by arbitrary number of non-carriage characters, finally followed by the carriage return symbol ‘\n’ which is for line break. Notice that “.” stands for any character that’s not carriage return. This is why the search won’t return a single match of the form

100%   30KB  30.2KB/s   00:00
Qsignal3-12.pdf                               100%   25KB  25.4KB/s   00:00
TVsignal3-2.pdf                               100%   31KB  30.5KB/s   00:00
TVsignal3-4.pdf                               100%   27KB  27.3KB/s   00:00
TVsignal3-6.pdf                               100%   17KB  17.2KB/s   00:00

If “.” actually includes ‘\n’, then by the default greedy behavior, the above result would be appropriate.

The result one gets from this command is

Qsignal3-5.pdf}
\includegraphics{Qsignal3-12.pdf}
\ includegraphics{TVsignal3-2.pdf}
\includegraphics{TVsignal3-4.pdf}
\ includegraphics{TVsignal3-6.pdf}
\includegraphics{

So one still has to take care of the first and last instance. A smarter way is to search for strings that start with ‘Q’ or ‘T’ and end with “pdf”. This is accomplished by

search: ([QT].*pdf).*\n

replace with: \\includegraphics\{\1\}\n

Here “[QT]” stands for a single character which is either ‘Q’ or ‘T’. The “\1” looks for the first substring in bracket “( )” in the search query and replace it intact, as covered in the previous lecture on regex.

Finally we want to discuss how to override the greedy behavior of kile’s regular expression replacement utility. The usual way is to use the question mark ‘?’ wild card as in the following example.

Say I have a latex snippet:

{dollar sign}\alpha_c{dollar sign} {dollar sign}\beta^t{dollar sign}

And I want to replace all {dollar sign}..{dollar sign} groups with {dollar sign}latex ..{dollar sign}.

If you do a search of {dollar sign}.*{dollar sign} it will return a single match of the whole text

{dollar sign}\alpha_c{dollar sign} {dollar sign}\beta^t{dollar sign}

because it always looks for the longest matching substring.

Now generally you can use the question mark trick to override it; it’s actually not an additional rule, but follows logically from the expected behavior of question marks:

search for: {dollar sign}(.*)?{dollar sign} and replace by: {dollar sign}latex \1{dollar sign}.

Unfortunately, kile doesn’t seem to support ‘?’ at all. So instead we use another hack: we can ask for the maximum substring after the first ‘{dollar sign}’ sign that consists of characters that are not ‘{dollar sign}’, so search for {dollar sign}([^{dollar sign}]*){dollar sign} instead

and replace by {dollar sign}latex \1{dollar sign} as before.

This is really the single most important use of regular expression in kile for me. Before I had to rely on the external perl program called wordpresslatex or something to convert usual latex to wordpress format, now I can just do that at the tip of my finger. Also there won’t be any unusual formatting included in the perl program any more!

p.s.: I have to use {dollar sign} to denote the US dollar sign because wordpress has special meaning reserved for the actual symbol. Also dollar sign must be backslash escaped.

Advertisements

About aquazorcarson

math PhD at Stanford, studying probability
This entry was posted in computer science and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s