load external DOS program and persist in memory

ceaton · Mar 21, 2022

I am using a TCC batch file that repeatedly runs an external DOS program (Swiss File Knife) - many thousands of times. To speed things up, I would like to load this small program into memory so it doesn't have to be re-loaded each time from the disk drive. Is there a way TCC can keep an external program like this in memory to call repeatedly?

Thanks!

Charlie

samintz · Mar 21, 2022

I was not familiar with sfk so I did a Google search. It looks like a normal Windows executable (not a DOS program). Assuming your batch script is just running an sfk command, Windows file caching should allow it to reload quite quickly.
Barring that, you'd have to create a TCC plug-in to have it available to TCC "faster". And I put 'faster' in quotes because I'm not sure it would be all that much faster.

Charles Dye · Mar 21, 2022

Or a RAMdisk. Though as Scott says, Windows file caching ought to give you RAMdisk-like speeds without any special effort on your part.

ceaton · Mar 21, 2022

Thanks very much! My batch file cycles through a list of about 30,000 search terms for each text file it's searching, so I'm looking even for small increments... Still worth me looking at creating a plug-in? Who do I work with for that? Thanks again!

vefatica · Mar 21, 2022

What does SFK do that you can't do with commands, variables, and functions that are built-into TCC? If you're searching for text in files, TCC's FFIND may help.

ceaton · Mar 24, 2022

SFK lets me search for a term bounded by a custom list of characters, count the number of matches, and format the output in ways I can't easily do with FFIND. I am using a TCC batch file to set everything up and just call SFK for the actual search.

At Charles' suggestion, I set up a RAMdisk (RAMDisk - Software that Accelerates, Protects, Optimizes - Server Memory Products & Services - Dataram) and ran everything in this. No improved performance. Each search term takes about 0.07 seconds to run. 0.07sec X 30,000 search terms = ~35 minutes to search each document with or without RAMdisk. Still pretty slow: about 42 articles per day = months of continuous running to process thousands of articles. Maybe the wrong technology... Still open to suggestions. Thanks!

vefatica · Mar 24, 2022

I'd love to make suggestions but it's not very clear exactly what you're doing. Do you know regular expressions? Have you read about TPIPE?

ceaton · Mar 25, 2022

Thanks, Vince -

This is the goal of the line of code that is repeated over and over and over:

In the text file %@NAME[%fn%].txt

1. look for the search term %CURRENTTERMG.
2. but only if this search term is bracketed on each side by one of a number of specified special characters
3. retrieve the entire contents of the first matched line
4. convert adjacent whitespace (tabs or spaces) to one space in the retrieved line contents.
5. append other variables and the output of step 4 as a single new line in C:\temp\0TRASHGENE.csv

This is how I coded this in one daisy-chained SFK command:
sfkx64 xfind "C:\temp\DupTXTs\%@NAME[%fn%].txt" "/[char of \r\n\t\x20\x21\x22\x26\x27\x28\x29\x2B\x2C\x2E\x2f\x3A\x3B\x3C\x3D\x3E\x3F\x5B\x5C\x5D\x5F\x7B\x7D\\]%CURRENTTERMG[char of \r\n\t\x20\x21\x22\x26\x27\x28\x29\x2B\x2C\x2E\x2f\x3A\x3B\x3C\x3D\x3E\x3F\x5B\x5C\x5D\x5F\x7B\x7D\\]/" -firsthit +xed "/[chars of (\t )]/ /" +filter -trim -join +xex "/*[eol]/%@NAME[%fn%].pdf%,%MAING%,%CURRENTTERMG%,%NMBG%,%DUP%,\q[part 1]\q\n/" >> C:\temp\0TRASHGENE.csv

I'm familiar with TPIPE and regular expressions and would be grateful for guidance on how to do this in TCC without having to resort to temp files.

vefatica · Mar 25, 2022

You could do the same with TPIPE (and in a similar fashion) but I doubt it would be any faster. You'd be starting an EXE every time (TCC's internal TPIPE starts TPIPE.EXE). And that's going to act like starting SFKX64 every time, to wit, slow.

Is this the sort of thing others might want to do? Did you look for more task-specific ready-made tools?

And you have 30,000 search terms ... eh? I don't think any EXE, called once per search term, will be efficient. You need a program that when called once will do all 30,000 searches

ceaton · Mar 25, 2022

Vince, thanks very much for the heads up on TPIPE being a separate program! I don't know the commercial applicability of this pet project. I am searching articles on a particular disease for mentions of specific genes and their many alternate names. This program runs through my article archives and spits out the document name, gene name, the found term, the number of matches, and a snippet of text containing the first match. The closest out-of-the-box product for this search is Mastermind - Comprehensive Genomic Search Engine but it is not an adequate fit for my needs. I am open to paying a programmer to get this done.

David Marcus · Mar 26, 2022

Why are you doing the search terms one by one? This sounds like you want a full-text database index. Or, you just need to write a program that does what you want.

ceaton · Mar 26, 2022

David, you hit the nail on the head. My approach has been inappropriate. The proper solution is a full-text database. I want to automate the process of finding instances of many terms in many documents, count the instances per document, and display the text surrounding the matches so I can weed out documents in which matched terms exist only in lists and bibliography sections. I want to export the output in spreadsheet-compatible format to use in an existing spreadsheet containing other information about the search terms and external links to online databases and source documents. I don't know how to write the program to do this. I'm looking into porting the contents of the documents into an XML database to do this, but would much rather not have to get on the learning curve where I have no expertise to reinvent the wheel. I've been working on this for over a year and it's taken this long to be convinced the answer is not just one more line of code away. Again, any suggestions on programs or people would be much appreciated. Thanks again!

David Marcus · Mar 26, 2022

If you want a database with full-text indexes, you can try MariaDB. I haven't used the full-text indexes, but it does have them. Of course, you would then have to see if you can do what you want using SQL.

> I don't know how to write the program to do this.

What part don't you know?

BTM files are great, but I wouldn't use them to write a serious app. What programming language/compiler do you use to write programs? For this project, you'd probably want to make sure it includes a regex library (but obviously I haven't tried to do a detailed design for your project).

vefatica · Mar 26, 2022

@ceaton, it sounds like you have a constant supply of scientific articles in plain text format. Is that right?

I'd like to do some experimenting. Does the set of search strings change (ever, often); is their changing an issue? How long are the search strings? How long are the articles?

Joe Caverly · Mar 26, 2022

Have you tried using Everything to search the Content of files?

From Voidtools:
Does Everything search file contents?
Yes, "Everything" can search file content with the content: search function.
File content is not indexed, searching content is slow.

Would using Everything search be faster, slower, or the same, using your present method?

Joe

Joe Caverly · Mar 26, 2022

Here's a forum message in regards to "Searching 50,000 documents by content in memory".

Joe

vefatica · Mar 26, 2022

I think the problem here is not the number of documents to be searched, but rather that, for any given document, there are 30,000 search strings and an occurrence-count of each (plus more info) is desired. Reading the file 30,000 times and searching for one string each time is one way.. An alternative ... read the file once and do 30,000 searches of each line. I suppose the second of those will be faster, but how much faster (?) ... I wouldn't venture a guess.

Kachupp · Mar 27, 2022

Post the 30k plus data text file to your drop box so we can look at ways of extracting the data text you require

w_krieger · Mar 27, 2022

The way to do this, if it is a real DOS program (as opposed to say a console program), is to load it under a command.com from msdos or pcdos 5.00, or the command.com from freedos, or 4dos. Then load the program to memory, and you can run your btm under 4dos.

This is the way for running UBASIC heavy factorisation program under Windows NT. Any of the named programs above can be 'kept' if one needs a DOS program to interact with TSR programs.

David Marcus · Mar 27, 2022

Read the file. Break it into tokens. Keep a list of the tokens that you are interested in and how many times each occurs. Full-text search - Wikipedia

Joe Caverly · Mar 27, 2022

ceaton said:
I am using a TCC batch file that repeatedly runs an external DOS program (Swiss File Knife) - many thousands of times. To speed things up, I would like to load this small program into memory so it doesn't have to be re-loaded each time from the disk drive. Is there a way TCC can keep an external program like this in memory to call repeatedly?

Thanks!

Charlie

Have you looked at writing an SFK script?

Code:

   scripting
      sfk help chain - how to combine multiple commands
      sfk batch      - run many sfk commands in a script file
      sfk label      - define starting points within a script
      sfk call       - call a sub function at a label
      sfk echo       - print (coloured) text to terminal
      sfk color      - change text color of terminal
      sfk setvar     - put text into an sfk variable
      sfk storetext  - store text in memory for later use
      sfk alias      - create command from other commands
      sfk mkcd       - create command to reenter directory
      sfk sleep      - delay execution for milliseconds
      sfk pause      - wait for user input
      sfk stop       - stop sfk script execution
      sfk tee        - split command output in two streams
      sfk tofile     - save command output to a file
      sfk toterm     - flush command output to terminal
      sfk for        - repeat commands many times
      sfk loop       - repeat execution of all commands
      sfk cd         - change directory within a script
      sfk getcwd     - print the current working directory
      sfk require    - compare version text
      sfk time [-h]  - print current date and time

That way, SFK is loaded into memory, so it doesn't have to be reloaded each time from the disk drive.

Joe

vefatica · Mar 27, 2022

Joe Caverly said:
That way, SFK is loaded into memory, so it doesn't have to be reloaded each time from the disk drive.

That looks promising. One big question is whether an SFK script can loop on the 30,000 search strings, say, read from a 30,000 line file of search strings. If that's possible, It ought to be easy and provide a significant improvement over starting SFK 30,000 times.

ceaton · Mar 30, 2022

Thanks for all your feedback! Based on this, I'm trying a different approach: export text from multiple PDFs text to a single XML file, then use BaseX or another free XML database to grind through the XML of many PDFs in one go. It makes more sense than to thrash through thousands of iterations on thousands of files one term at a time. I've created a .btm that quickly exports text from an unlimited many PDFs to a single XML. On a test run, 800 PDFs yielded a ~50 mb XML which easily loads into BaseX. The next step is the XML script. If I get it up and running, I'll post the steps. Thanks again, all!

Search

Welcome!

load external DOS program and persist in memory

ceaton

samintz

Scott Mintz

Charles Dye

Super Moderator

ceaton

vefatica

ceaton

vefatica

ceaton

vefatica

ceaton

David Marcus

ceaton

David Marcus

vefatica

Joe Caverly

Joe Caverly

vefatica

Kachupp

w_krieger

David Marcus

Joe Caverly

vefatica

ceaton

Similar threads