Search and Replace text file very slow

Discussion:

(too old to reply)

jerschmidt14

2009-05-27 12:58:01 UTC

Hello,
I have text files approx 16-20mb in size. They are flat files with
approx 18-20 thousand records each. Every night I have to search them for
invalid ascii characters, and replace them with spaces. Call it a filter if
you will. What I wrote was

(gc filein.txt) -replace "^[\u0020-\u007F]"," " | sc fileout.txt

Although this seems to do the job, it runs VERY SLOW!

I have tried adjust the -read by adding

(gc filein.txt -read 5000 -replace "^[\u0020-\u007F]"," " | sc fileout.txt

This runs much faster. However, it then seems to miss the line endings.

Thanks in advance,

Jeremy

Robert Robelo

2009-05-27 17:55:37 UTC

Permalink

Wrapping the Get-Content statement in an Expression ( ) and passing the [String[]] to -replace _is_ a good technique, specially when you want to overwrite the file being read, but it is _not_ recommended for huge files because it hogs up lots of RAM and takes forever if the shell doesn't crash.

Since you're writing the output to a different file, pipe each String to ForEach-Object, do the replacement and pipe the new String to Set-Content

gc filein.txt | % {$_ -replace "^[\u0020-\u007F]"," "} | sc file

jerschmidt14

2009-06-02 17:18:01 UTC

Permalink

Hi Robert,
Thanks for the reply. I did as you mentioned, but still note powershell
to be very slow. If I write the code in perl, it only takes approx 2 seconds
to run. The powershell example takes 10-20 seconds to run (approx 10 times
as long). I imagine this is because powershell is processing this on a "line
by line" basis", whereas perl I can redirect STDIN, do a TC, then print out.
Would there be any way to process the whole file in a few iterations making
using of the -read parameter? How about switching the gc mode to binary?

Jeremy

In Perl

Post by Robert Robelo
Wrapping the Get-Content statement in an Expression ( ) and passing the
[String[]] to -replace _is_ a good technique, specially when you want to
overwrite the file being read, but it is _not_ recommended for huge files
because it hogs up lots of RAM and takes forever if the shell doesn't
crash.
Since you're writing the output to a different file, pipe each String to
ForEach-Object, do the replacement and pipe the new String to Set-Content
gc filein.txt | % {$_ -replace "^[\u0020-\u007F]"," "} | sc fileout.txt
--
Robert

jerschmidt14

2009-06-02 17:21:01 UTC

Permalink

Oops, forgot to attach the perl

binmode STDIN;
binmode STDOUT;

while(<STDIN>) {
tr/\040-\176\012\015/ /c;
print $_;
}

Post by jerschmidt14
Hi Robert,
Thanks for the reply. I did as you mentioned, but still note powershell
to be very slow. If I write the code in perl, it only takes approx 2 seconds
to run. The powershell example takes 10-20 seconds to run (approx 10 times
as long). I imagine this is because powershell is processing this on a "line
by line" basis", whereas perl I can redirect STDIN, do a TC, then print out.
Would there be any way to process the whole file in a few iterations making
using of the -read parameter? How about switching the gc mode to binary?
Jeremy
In Perl

tojo2000

2009-06-02 19:06:29 UTC

Permalink

On Jun 2, 10:18 am, jerschmidt14

You might want to check out System.IO.StreamReader:
http://msdn.microsoft.com/en-us/library/system.io.streamreader.streamreader.aspx

Also, you could try compiling a regex ahead of time and then using the
Replace() method. It's possible that using -replace might be
compiling the regex each time.

Continue reading on narkive:

Search results for 'Search and Replace text file very slow' (Questions and Answers)

replies

MY computer has been acting VERY SLOW LATELY, whenever i right click on something in a folder,computer freezes?

started 2009-08-13 14:37:38 UTC

computers & internet

replies

Computer running slow?

started 2007-05-18 11:03:36 UTC

security

replies

My PC gets shut down when i trying to play video files or games on it , but when i surf or working on applica?