Faster string search

Oct 20, 2017
28
0
Netherlands
For one particular string in each of 90.000 files, I want to determine if this string contains a certain name. An external program retrieves the string from the file and the function @WILD determines whether this string matches a name.
Before the @WILD function starts, the names are loaded as part of an array:
Code:
SETARRAY /F m[40,1]
SET m[0,0]= 2rebels
SET m[1,0]= a2-type
SET m[2,0]=…
…
The search start with:
Code:
DO r = 0 TO %@DEC[%@arrayinfo[m,1]]
    IFF %@WILD["%String",*%m[%r,0]*] == 1 THEN
    REM Name found in string
LEAVE
    ELSE
    REM Name not found in string
    ENDIFF
ENDDO
This search worked fast and without any problem. But, in the last months the array has grown to more than 400 names SETARRAY /F m[443,1] and the search now takes 2 seconds per string. For 90.000 files to search, it takes more than 2 days.

Question: is there another faster way to search with TCMD?
 
May 20, 2008
10,623
81
Syracuse, NY, USA
I think it would be quite fast if instead of @WILD more than 400 times, you used @REGEX once. As @REGEX's expression (%bigregex, below) use the disjunction of the strings you're looking for.

Code:
v:\> set bigregex
"string1|string2|string3|string4|string5|string6|string7|string8|string9|string10|string11|string12|str
ing13|string14|string15|string16|string17|string18|string19|string20|string21|string22|string23|string2
4|string25|string26|string27|string28|string29|string30|string31|string32|string33|string34|string35|st
ring36|string37|string38|string39|string40|string41|string42|string43|string44|string45|string46|string
47|string48|string49|string50|string51|string52|string53|string54|string55|string56|string57|string58|s
tring59|string60|string61|string62|string63|string64|string65|string66|string67|string68|string69|strin
g70|string71|string72|string73|string74|string75|string76|string77|string78|string79|string80|string81|
string82|string83|string84|string85|string86|string87|string88|string89|string90|string91|string92|stri
ng93|string94|string95|string96|string97|string98|string99|string100|string101|string102|string103|stri
ng104|string105|string106|string107|string108|string109|string110|string111|string112|string113|string1
14|string115|string116|string117|string118|string119|string120|string121|string122|string123|string124|
string125|string126|string127|string128|string129|string130|string131|string132|string133|string134|str
ing135|string136|string137|string138|string139|string140|string141|string142|string143|string144|string
145|string146|string147|string148|string149|string150|string151|string152|string153|string154|string155
|string156|string157|string158|string159|string160|string161|string162|string163|string164|string165|st
ring166|string167|string168|string169|string170|string171|string172|string173|string174|string175|strin
g176|string177|string178|string179|string180|string181|string182|string183|string184|string185|string18
6|string187|string188|string189|string190|string191|string192|string193|string194|string195|string196|s
tring197|string198|string199|string200|string201|string202|string203|string204|string205|string206|stri
ng207|string208|string209|string210|string211|string212|string213|string214|string215|string216|string2
17|string218|string219|string220|string221|string222|string223|string224|string225|string226|string227|
string228|string229|string230|string231|string232|string233|string234|string235|string236|string237|str
ing238|string239|string240|string241|string242|string243|string244|string245|string246|string247|string
248|string249|string250|string251|string252|string253|string254|string255|string256|string257|string258
|string259|string260|string261|string262|string263|string264|string265|string266|string267|string268|st
ring269|string270|string271|string272|string273|string274|string275|string276|string277|string278|strin
g279|string280|string281|string282|string283|string284|string285|string286|string287|string288|string28
9|string290|string291|string292|string293|string294|string295|string296|string297|string298|string299|s
tring300|string301|string302|string303|string304|string305|string306|string307|string308|string309|stri
ng310|string311|string312|string313|string314|string315|string316|string317|string318|string319|string3
20|string321|string322|string323|string324|string325|string326|string327|string328|string329|string330|
string331|string332|string333|string334|string335|string336|string337|string338|string339|string340|str
ing341|string342|string343|string344|string345|string346|string347|string348|string349|string350|string
351|string352|string353|string354|string355|string356|string357|string358|string359|string360|string361
|string362|string363|string364|string365|string366|string367|string368|string369|string370|string371|st
ring372|string373|string374|string375|string376|string377|string378|string379|string380|string381|strin
g382|string383|string384|string385|string386|string387|string388|string389|string390|string391|string39
2|string393|string394|string395|string396|string397|string398|string399|string400"

v:\> echo %@regex[%bigregex,string18]
1

v:\> echo %@regex[%bigregex,string018]
0
Note: KEYSTACK helped me build the SET command which set BIGREGEX.

You can also look for those strings in files. Below, I looked for all 400 strings in 12708 files (all TXT files in C:) in under 12 seconds!
Code:
v:\> (do f in /d"c:\" /s *.txt ( ffind /k /m /v /e%bigregex "%f" )) 2>NUL
             Format-XML -strings string1, string2, string3
           Format-XML -strings string1, string2, string3
             Format-XML -strings string1, string2, string3
           Format-XML -strings string1, string2, string3

v:\> echo %_do_loop
12708
 
Oct 20, 2017
28
0
Netherlands
Dear Scott,

First, thank you looking for answers.

TPIPE: Take Command Help v24: TPIPE is substantialy slower than reading from and writing to files.
FFIND /E: Multiple names could be stored in the file. The external program creates a structured output with each line starting with a unique keyword. I use FFIND /E to search for the line with the unique keyword to retrieve that line. Next, I have to clean the "dangerous characters" in this line with the SafeChars plugin.

Possible solution: In a DO-loop, I could have FFIND /E"keyword ...%m[%r,0]" search the output file and watch if it sets the %_ffind_matches internal variable.

I have to think about that.

Marcel
 
Oct 20, 2017
28
0
Netherlands
Dear Vince,

Thank you.

Good approach to a solution: executing one search with 400 searchwords.

I have to rebuild and test my batch-file. The use of KEYSTACK to build the SET command is new to me.

Marcel
 
May 20, 2008
10,623
81
Syracuse, NY, USA
The use of KEYSTACK to build the SET command is new to me.
It's new to me too. But I wasn't going to type all that. Here's the command I used in one TCC while building the command line in another TCC.
Code:
delay 5 & do i=1 to 400 ( wait .5s &  keystack "string%i|" )
During the delay I switched to the one building the command line. But editing that command_line_in_progress was a bear. Every insertion/deletion of a character near the beginning of the line took about 5 seconds (as the whole rest of the line was redrawn).

It probably would have been easier to do this but I didn't think of it at the time.
Code:
v:\> set bigregex="string1

v:\> setdos /x-5

v:\> do i=2 to 400 ( set bigregex=%[bigregex]^|string%i )

v:\> set bigregex=%[bigregex]"

v:\> setdos /x+5
 
Aug 3, 2016
376
9
Netherlands
- Create a textfile SearchThis.txt with content:

Code:
2rebels
a2-type
…
Search with:

Code:
echo %string%|findstr.exe /i /g:SearchThis.txt >nul && echo String exists || echo String not found
- or -
Code:
unset StringExist
echo %string%|findstr.exe /i /g:SearchThis.txt >nul && set StringExist=1

IFF %StringExist% == 1 THEN
..
ELSE
...
ENDIFF
unset StringExist
(Not tested; just for inspiration)
 
Oct 20, 2017
28
0
Netherlands
Thank you all for your ideas. I'm working on the implementation.

Meanwhile, a simple function change boosted the execution speed with 80%:
I swapped the @WILD-function for the @INDEX-function and my batchfile is flying again.
 
Oct 20, 2017
28
0
Netherlands
Sorry, fake news. An error made me happy for 30 minutes. :confused:
Actually, @INDEX is 10% slower than @WILD, and has the well-known comma-problem in string2.