How to Read in String With Spaces Regex Expression in C
Get-go of String and End of String Anchors
Thus far, we take learned about literal characters, grapheme classes, and the dot. Putting one of these in a regex tells the regex engine to try to match a single character.
Anchors are a dissimilar breed. They do not friction match whatsoever character at all. Instead, they friction match a position before, afterward, or betwixt characters. They tin be used to "ballast" the regex match at a certain position. The caret ^ matches the position before the showtime character in the string. Applying ^ a to abc matches a. ^ b does not match abc at all, because the b cannot be matched right later the starting time of the string, matched by ^ . Run across beneath for the inside view of the regex engine.
Similarly, $ matches right after the last character in the cord. c $ matches c in abc, while a $ does non friction match at all.
A regex that consists solely of an ballast can simply find zero-length matches. This can be useful, simply tin can likewise create complications that are explained nigh the end of this tutorial.
Useful Applications
When using regular expressions in a programming language to validate user input, using anchors is very important. If you use the code if ($input =~ m/\d+/) in a Perl script to see if the user entered an integer number, information technology volition accept the input fifty-fifty if the user entered qsdf4ghjk, because \d + matches the 4. The correct regex to employ is ^ \d + $ . Because "start of string" must be matched before the friction match of \d + , and "finish of string" must be matched correct after it, the entire cord must consist of digits for ^ \d + $ to be able to friction match.
It is easy for the user to accidentally type in a space. When Perl reads from a line from a text file, the line break is as well be stored in the variable. So before validating input, it is good practise to trim leading and trailing whitespace. ^ \s + matches leading whitespace and \s + $ matches trailing whitespace. In Perl, you could utilize $input =~ s/^\s+|\southward+$//g. Handy apply of alternation and /g allows the states to do this in a single line of lawmaking.
Using ^ and $ as Get-go of Line and Terminate of Line Anchors
If y'all have a string consisting of multiple lines, like offset line\nsecond line (where \n indicates a line break), it is often desirable to piece of work with lines, rather than the entire string. Therefore, nearly regex engines discussed in this tutorial have the option to expand the meaning of both anchors. ^ can then match at the get-go of the string (before the f in the above string), likewise as afterwards each line break (between \n and s). Likewise, $ still matches at the end of the string (after the last east), and also earlier every line pause (betwixt due east and \due north).
In text editors like EditPad Pro or GNU Emacs, and regex tools like PowerGREP, the caret and dollar always match at the start and end of each line. This makes sense because those applications are designed to work with entire files, rather than short strings. In Ruby and std::regex the caret and dollar also ever lucifer at the offset and end of each line. In Boost they friction match at the offset and end of each line by default. Boost allows you to turn this off with regex_constants::no_mod_m when using the ECMAScript grammer.
In all other programming languages and libraries discussed on this website , you lot have to explicitly activate this extended functionality. It is traditionally called "multi-line mode". In Perl, you practise this past adding an yard after the regex lawmaking, like this: m/^regex$/g;. In .Internet, the anchors friction match before and after newlines when you specify RegexOptions.Multiline, such as in Regex.Match("string", "regex", RegexOptions.Multiline).
Line Break Characters
The tutorial page about the dot already discussed which characters are seen every bit line break characters by the various regex flavors. This affects the anchors just as much when in multi-line mode, and when the dollar matches before the stop of the final break. The anchors handle line breaks that consist of a single graphic symbol the same way equally the dot in each regex flavour.
For anchors in that location'south an additional consideration when CR and LF occur every bit a pair and the regex flavor treats both these characters as line breaks. Delphi, Java, and the JGsoft flavor treat CRLF as an indivisible pair. ^ matches afterwards CRLF and $ matches before CRLF, but neither match in the middle of a CRLF pair. JavaScript and XPath care for CRLF pairs every bit 2 line breaks. ^ matches in the middle of and after CRLF, while $ matches earlier and in the middle of CRLF.
Permanent Get-go of String and Finish of String Anchors
\A only always matches at the offset of the string. Likewise, \Z only e'er matches at the terminate of the string. These two tokens never match at line breaks. This is true in all regex flavors discussed in this tutorial, even when yous turn on "multiline mode". In EditPad Pro and PowerGREP, where the caret and dollar always lucifer at the commencement and finish of lines, \A and \Z only match at the offset and the cease of the entire file.
JavaScript, POSIX, XML, and XPath do non support \A and \Z . Yous're stuck with using the caret and dollar for this purpose.
The GNU extensions to POSIX regular expressions use \` (backtick) to match the get-go of the string, and \' (single quote) to friction match the end of the cord.
Strings Ending with a Line Break
Because Perl returns a string with a newline at the end when reading a line from a file, Perl's regex engine matches $ at the position before the line intermission at the stop of the string even when multi-line style is turned off. Perl also matches $ at the very stop of the cord, regardless of whether that character is a line break. So ^ \d + $ matches 123 whether the subject cord is 123 or 123\due north.
Most modernistic regex flavors have copied this beliefs. That includes .NET, Java, PCRE, Delphi, PHP, and Python. This beliefs is independent of any settings such equally "multi-line mode".
In all these flavors except Python, \Z also matches before the final line break. If yous but desire a lucifer at the absolute very cease of the string, use \z (lowercase z instead of uppercase Z). \A \d + \z does not match 123\north. \z matches after the line pause, which is not matched by the shorthand graphic symbol form.
In Python, \Z matches only at the very finish of the string. Python does non support \z .
Strings Ending with Multiple Line Breaks
If a cord ends with multiple line breaks and multi-line mode is off then $ only matches earlier the last of those line breaks in all flavors where information technology can friction match before the final break. The aforementioned is truthful for \Z regardless of multi-line mode.
Boost is the only exception. In Boost, \Z can match before any number of trailing line breaks as well every bit at the very terminate of the string. So if the subject string ends with three line breaks, Boost's \Z has four positions that it can match at. Similar in all other flavors, Heave'due south \Z is contained of multi-line style. Boost's $ merely matches at the very end of the cord when y'all turn off multi-line mode (which is on by default in Boost).
Looking Within The Regex Engine
Let's see what happens when nosotros endeavour to match ^ four $ to 749\n486\n4 (where \n represents a newline grapheme) in multi-line mode. As usual, the regex engine starts at the commencement character: 7. The showtime token in the regular expression is ^ . Since this token is a nada-length token, the engine does non try to match it with the grapheme, merely rather with the position before the graphic symbol that the regex engine has reached so far. ^ indeed matches the position before vii. The engine then advances to the next regex token: 4 . Since the previous token was nil-length, the regex engine does not advance to the next graphic symbol in the string. It remains at 7. 4 is a literal character, which does not friction match 7. There are no other permutations of the regex, so the engine starts again with the first regex token, at the side by side character: 4. This fourth dimension, ^ cannot match at the position before the 4. This position is preceded by a character, and that character is non a newline. The engine continues at nine, and fails again. The adjacent endeavor, at \north, as well fails. Once again, the position before \north is preceded by a character, 9, and that character is not a newline.
And so, the regex engine arrives at the second 4 in the string. The ^ can match at the position before the 4, because it is preceded past a newline graphic symbol. Again, the regex engine advances to the next regex token, 4 , simply does not accelerate the character position in the cord. four matches 4, and the engine advances both the regex token and the cord character. At present the engine attempts to match $ at the position before (indeed: earlier) the eight. The dollar cannot match here, considering this position is followed past a character, and that character is not a newline.
Notwithstanding again, the engine must effort to match the kickoff token again. Previously, it was successfully matched at the second 4, so the engine continues at the next grapheme, 8, where the caret does not lucifer. Same at the 6 and the newline.
Finally, the regex engine tries to lucifer the first token at the 3rd iv in the string. With success. After that, the engine successfully matches 4 with 4. The electric current regex token is advanced to $ , and the current character is advanced to the very last position in the string: the void subsequently the string. No regex token that needs a grapheme to lucifer can match here. Not even a negated character class. However, we are trying to match a dollar sign, and the mighty dollar is a strange animal. It is zero-length, and then information technology tries to match the position before the current grapheme. It does not matter that this "character" is the void after the cord. In fact, the dollar checks the current character. Information technology must exist either a newline, or the void later on the string, for $ to match the position before the current character. Since that is the example after the instance, the dollar matches successfully.
Since $ was the last token in the regex, the engine has institute a successful lucifer: the final 4 in the cord.
Source: https://www.regular-expressions.info/anchors.html
0 Response to "How to Read in String With Spaces Regex Expression in C"
Enregistrer un commentaire