HTML conversion

  • This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.
May 30, 2008
205
1
#1
Not sure if this is the right forum for this but...

I sometimes use TCC to download HTML pages from web sites, convert them to text and then parse out some wanted information for further processing/presentation.

TCC has builtin support for the first part (COPY with HTTP support) and the last (various text processing functions) but there is no direct support for converting HTML to (readable) text. I could of course try to parse to HTML file directly but it's much easier to first convert it to a text format (Matches what you see in the browser).

So far I have used the Windows port of the links console browser with the "-dump" option to do the conversion. But it's an old unmaintained port that does not seem to work on Windows Vista.

Does anyone know of some good html to txt console programs for Windows? Open source and native (not requiring e.g. Python) would be a plus.

Would it perhaps be possible to add something like this in future TCC versions? Seems like it would be a good complement to what's already supported.
 
#2
nikbackm wrote:
|| Not sure if this is the right forum for this but...
||
|| I sometimes use TCC to download HTML pages from web sites, convert
|| them to text and then parse out some wanted information for further
|| processing/presentation.
||
|| TCC has builtin support for the first part (COPY with HTTP support)
|| and the last (various text processing functions) but there is no
|| direct support for converting HTML to (readable) text. I could of
|| course try to parse to HTML file directly but it's much easier to
|| first convert it to a text format (Matches what you see in the
|| browser).
||
|| So far I have used the Windows port of the links console browser
|| with the "-dump" option to do the conversion. But it's an old
|| unmaintained port that does not seem to work on Windows Vista.
||
|| Does anyone know of some good html to txt console programs for
|| Windows? Open source and native (not requiring e.g. Python) would be
|| a plus.
||
|| Would it perhaps be possible to add something like this in future
|| TCC versions? Seems like it would be a good complement to what's
|| already supported.

You could write a very simple BTM file, which reads the data file as BINARY
one character at a time. Whenever it finds the "<" (less than) character, it
stops passing the output until it is past the (matching) ">" (greater than)
character. You may also find it convenient to drop space characters at the
beginning of a line and after other space characters. This is based on my
(very rudimentary) understanding that in HTML all tags start with "<", and
tags do not embed other tags. You may want to retain some tags or tag
elements, e.g. <p>xxx</p> and href="URL". If you have access to MS Word, it
can interpret an HTML file and save it for you as a text file.
--
HTH, Steve
 
#3
Not sure if this is the right forum for this but...

Does anyone know of some good html to txt console programs for Windows? Open source and native (not requiring e.g. Python) would be a plus.
Hi,
Using the COM interface of Internet Explorer, you can retrieve text content from the web page. Here's how it is done in Microsoft PowerShell;

Code:
$ie = New-Object -ComObject InternetExplorer.Application
$ie.Navigate("http://www.google.ca")
$ie.Document.Body.InnerText
$ie.Quit()
You could also navigate to a file that you have downloaded, for example;

Code:
$ie.Navigate("file:///C:/utils/Google.htm")
You can use VBScript, Visual Basic, or any other COM-capable language, to access a COM object. Microsoft PowerShell makes it very easy to access any COM object from the command line. I usually have Microsoft PowerShell active in its own tab for just such tasks.

While it would be nice to have the internal ability to filter HTML to text in TCMD, this feature could be added via a Plugin, if one were so inclined to write said Plugin.

Not sure if this will work on Vista, but you could also try HTMSTRIP from;

http://users.erols.com/waynesof/bruce.htm

This is a 16-bit DOS program, from many years gone by.

Joe
 
May 29, 2008
36
0
#5
> So far I have used the Windows port of the links console browser with the "-dump" option to do the conversion. But it's an old unmaintained port that does not seem to work on Windows Vista.
>
Do you have a recent version of Lynx? I'm using version 2.8.5 with Vista and it
works fine. I use it to retrieve some web pages and "-dump" them to text files.
There are many different Win32 ports of Lynx and the quality is inconsistent.
The 2.8.5 version I'm using came from http://fredlwm.iblogger.org/lynx/ but they
now have a version 2.8.7 which I haven't tried.

Dennis
 

samintz

Scott Mintz
May 20, 2008
1,203
11
Solon, OH, USA
#6
Check out Pure Text by Steve Miller. It allows you to paste just text
from the clipboard. It's extremely handy.

http://www.SteveMiller.net/puretext/

Have you ever copied some text from a web page or a document and then
wanted to paste it as simple text into another application without getting
all the formatting from the original source? PureText makes this simple by
adding a new Windows hot-key (default is WINDOWS+V) that allows you to
paste text to any application without formatting.
After running PureText.exe, you will see a "PT" tray icon appear near the
clock on your task bar. You can click on this icon to remove formatting
from the text that is currently on the clipboard. You can right-click on
the icon to display a menu with more options.

-Scott

Steve F$BaC(Bi$BaO(B <> wrote on 10/14/2009 08:18:45 AM:


> nikbackm wrote:
> || Not sure if this is the right forum for this but...
> ||
> || I sometimes use TCC to download HTML pages from web sites, convert
> || them to text and then parse out some wanted information for further
> || processing/presentation.
> ||
> || TCC has builtin support for the first part (COPY with HTTP support)
> || and the last (various text processing functions) but there is no
> || direct support for converting HTML to (readable) text. I could of
> || course try to parse to HTML file directly but it's much easier to
> || first convert it to a text format (Matches what you see in the
> || browser).
> ||
> || So far I have used the Windows port of the links console browser
> || with the "-dump" option to do the conversion. But it's an old
> || unmaintained port that does not seem to work on Windows Vista.
> ||
> || Does anyone know of some good html to txt console programs for
> || Windows? Open source and native (not requiring e.g. Python) would be
> || a plus.
> ||
> || Would it perhaps be possible to add something like this in future
> || TCC versions? Seems like it would be a good complement to what's
> || already supported.
>
> You could write a very simple BTM file, which reads the data file as
BINARY

> one character at a time. Whenever it finds the "<" (less than)
character, it

> stops passing the output until it is past the (matching) ">" (greater
than)

> character. You may also find it convenient to drop space characters at
the

> beginning of a line and after other space characters. This is based on
my

> (very rudimentary) understanding that in HTML all tags start with "<",
and

> tags do not embed other tags. You may want to retain some tags or tag
> elements, e.g.
> xxx
> and href="URL". If you have access to MS Word, it
> can interpret an HTML file and save it for you as a text file.
> --
> HTH, Steve
>
>
 
May 30, 2008
205
1
#7
Thanks for all the answers!

There were some interesting options there. For me, I think perhaps going with Lynx might be easiest with my current batch file set.