Discussion:
Search and Replace text file very slow
(too old to reply)
jerschmidt14
2009-05-27 12:58:01 UTC
Permalink
Hello,
I have text files approx 16-20mb in size. They are flat files with
approx 18-20 thousand records each. Every night I have to search them for
invalid ascii characters, and replace them with spaces. Call it a filter if
you will. What I wrote was

(gc filein.txt) -replace "^[\u0020-\u007F]"," " | sc fileout.txt

Although this seems to do the job, it runs VERY SLOW!

I have tried adjust the -read by adding

(gc filein.txt -read 5000 -replace "^[\u0020-\u007F]"," " | sc fileout.txt

This runs much faster. However, it then seems to miss the line endings.

Thanks in advance,

Jeremy
Robert Robelo
2009-05-27 17:55:37 UTC
Permalink
Wrapping the Get-Content statement in an Expression ( ) and passing the [String[]] to -replace _is_ a good technique, specially when you want to overwrite the file being read, but it is _not_ recommended for huge files because it hogs up lots of RAM and takes forever if the shell doesn't crash.

Since you're writing the output to a different file, pipe each String to ForEach-Object, do the replacement and pipe the new String to Set-Content

gc filein.txt | % {$_ -replace "^[\u0020-\u007F]"," "} | sc file
jerschmidt14
2009-06-02 17:18:01 UTC
Permalink
Hi Robert,
Thanks for the reply. I did as you mentioned, but still note powershell
to be very slow. If I write the code in perl, it only takes approx 2 seconds
to run. The powershell example takes 10-20 seconds to run (approx 10 times
as long). I imagine this is because powershell is processing this on a "line
by line" basis", whereas perl I can redirect STDIN, do a TC, then print out.
Would there be any way to process the whole file in a few iterations making
using of the -read parameter? How about switching the gc mode to binary?

Jeremy

In Perl
Post by Robert Robelo
Wrapping the Get-Content statement in an Expression ( ) and passing the
[String[]] to -replace _is_ a good technique, specially when you want to
overwrite the file being read, but it is _not_ recommended for huge files
because it hogs up lots of RAM and takes forever if the shell doesn't
crash.
Since you're writing the output to a different file, pipe each String to
ForEach-Object, do the replacement and pipe the new String to Set-Content
gc filein.txt | % {$_ -replace "^[\u0020-\u007F]"," "} | sc fileout.txt
--
Robert
jerschmidt14
2009-06-02 17:21:01 UTC
Permalink
Oops, forgot to attach the perl

binmode STDIN;
binmode STDOUT;

while(<STDIN>) {
tr/\040-\176\012\015/ /c;
print $_;
}
Post by jerschmidt14
Hi Robert,
Thanks for the reply. I did as you mentioned, but still note powershell
to be very slow. If I write the code in perl, it only takes approx 2 seconds
to run. The powershell example takes 10-20 seconds to run (approx 10 times
as long). I imagine this is because powershell is processing this on a "line
by line" basis", whereas perl I can redirect STDIN, do a TC, then print out.
Would there be any way to process the whole file in a few iterations making
using of the -read parameter? How about switching the gc mode to binary?
Jeremy
In Perl
Post by Robert Robelo
Wrapping the Get-Content statement in an Expression ( ) and passing the
[String[]] to -replace _is_ a good technique, specially when you want to
overwrite the file being read, but it is _not_ recommended for huge files
because it hogs up lots of RAM and takes forever if the shell doesn't
crash.
Since you're writing the output to a different file, pipe each String to
ForEach-Object, do the replacement and pipe the new String to Set-Content
gc filein.txt | % {$_ -replace "^[\u0020-\u007F]"," "} | sc fileout.txt
--
Robert
tojo2000
2009-06-02 19:06:29 UTC
Permalink
On Jun 2, 10:18 am, jerschmidt14
Post by jerschmidt14
Hi Robert,
    Thanks for the reply.  I did as you mentioned, but still note powershell
to be very slow.  If I write the code in perl, it only takes approx 2 seconds
to run.  The powershell example takes 10-20 seconds to run (approx 10 times
as long).  I imagine this is because powershell is processing this on a "line
by line" basis", whereas perl I can redirect STDIN, do a TC, then print out.  
Would there be any way to process the whole file in a few iterations making
using of the -read parameter?  How about switching the gc mode to binary?  
Jeremy
In Perl
Post by Robert Robelo
Wrapping the Get-Content statement in an Expression ( ) and passing the
[String[]] to -replace _is_ a good technique, specially when you want to
overwrite the file being read, but it is _not_ recommended for huge files
because it hogs up lots of RAM and takes forever if the shell doesn't
crash.
Since you're writing the output to a different file, pipe each String to
ForEach-Object, do the replacement and pipe the new String to Set-Content
gc filein.txt | % {$_ -replace "^[\u0020-\u007F]"," "} | sc fileout.txt
--
Robert
You might want to check out System.IO.StreamReader:
http://msdn.microsoft.com/en-us/library/system.io.streamreader.streamreader.aspx

Also, you could try compiling a regex ahead of time and then using the
Replace() method. It's possible that using -replace might be
compiling the regex each time.

Continue reading on narkive:
Loading...