Question

I'm trying to filter thousands of files, looking for those which contain string constants with mixed case. Such strings can be embedded in whitespace, but may not contain whitespace themselves. So the following (containing UC chars) are matches:

"  AString "   // leading and trailing spaces together allowed
"AString "     // trailing spaces allowed
"  AString"    // leading spaces allowed
"newString03"  // numeric chars allowed
"!stringBIG?"  // non-alphanumeric chars allowed
"R"            // Single UC is a match

but these are not:

"A String" // not a match because it contains an embedded space
"Foo bar baz" // does not match due to multiple whitespace interruptions
"a_string" // not a match because there are no UC chars

I still want to match on lines which contain both patterns:

"ABigString", "a sentence fragment" // need to catch so I find the first case...

I want to use Perl regexps, preferably driven by the ack command-line tool. Obviously, \w and \W are not going to work. It seems that \S should match the non-space chars. I can't seem to figure out how to embed the requirement of "at least one upper-case character per string"...

ack --match '\"\s*\S+\s*\"'

is the closest I've gotten. I need to replace the \S+ with something that captures the "at least one upper-case (ascii) character (in any position of the non-whitespace string)" requirement.

This is straightforward to program in C/C++ (and yes, Perl, procedurally, without resorting to regexps), I'm just trying to figure out if there is a regular expression which can do the same job.

Was it helpful?

Solution

The following pattern passes all your tests:

qr/
  "      # leading single quote

  (?!    # filter out strings with internal spaces
     [^"]*   # zero or more non-quotes
     [^"\s]  # neither a quote nor whitespace
     \s+     # internal whitespace
     [^"\s]  # another non-quote, non-whitespace character
  )

  [^"]*  # zero or more non-quote characters
  [A-Z]  # at least one uppercase letter
  [^"]*  # followed by zero or more non-quotes
  "      # and finally the trailing quote
/x

Using this test program—that uses the above pattern without /x and therefore without whitespace and comments—as input to ack-grep (as ack is called on Ubuntu)

#! /usr/bin/perl

my @tests = (
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"A String">     => 0 ],
  [ q<"a_string">     => 0 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
  [ q<"  a String  "> => 0 ],
  [ q<"Foo bar baz">  => 0 ],
);

my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
for (@tests) {
  my($str,$expectMatch) = @$_;
  my $matched = $str =~ /$pattern/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",
        ": $str\n";
}

produces the following output:

$ ack-grep '"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' try
  [ q<"  AString ">   => 1 ],
  [ q<"AString ">     => 1 ],
  [ q<"  AString">    => 1 ],
  [ q<"newString03">  => 1 ],
  [ q<"!stringBIG?">  => 1 ],
  [ q<"R">            => 1 ],
  [ q<"ABigString", "a sentence fragment"> => 1 ],
my $pattern = qr/"(?![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"/;
  print +($matched xor $expectMatch) ? "FAIL" : "PASS",

With the C shell and derivatives, you have to escape the bang:

% ack-grep '"(?\![^"]*[^"\s]\s+[^"\s])[^"]*[A-Z][^"]*"' ...

I wish I could preserve the highlighted matches, but that doesn't seem to be allowed.

Note that escaped double-quotes (\") will severely confuse this pattern.

OTHER TIPS

You could add the requirement with a character class, like:

ack --match "\"\s*\S+[A-Z]\S+\s*\""

I'm assuming that ack matches one line at a time. The \S+\s*\" part can match multiple closing quotes in a row. It would match the entirety of "alfa"", instead of just "alfa".

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top