HTML conversion

nikbackm · Oct 14, 2009

Not sure if this is the right forum for this but...

I sometimes use TCC to download HTML pages from web sites, convert them to text and then parse out some wanted information for further processing/presentation.

TCC has builtin support for the first part (COPY with HTTP support) and the last (various text processing functions) but there is no direct support for converting HTML to (readable) text. I could of course try to parse to HTML file directly but it's much easier to first convert it to a text format (Matches what you see in the browser).

So far I have used the Windows port of the links console browser with the "-dump" option to do the conversion. But it's an old unmaintained port that does not seem to work on Windows Vista.

Does anyone know of some good html to txt console programs for Windows? Open source and native (not requiring e.g. Python) would be a plus.

Would it perhaps be possible to add something like this in future TCC versions? Seems like it would be a good complement to what's already supported.

Steve Fabian · Oct 14, 2009

nikbackm wrote:
|| Not sure if this is the right forum for this but...
||
|| I sometimes use TCC to download HTML pages from web sites, convert
|| them to text and then parse out some wanted information for further
|| processing/presentation.
||
|| TCC has builtin support for the first part (COPY with HTTP support)
|| and the last (various text processing functions) but there is no
|| direct support for converting HTML to (readable) text. I could of
|| course try to parse to HTML file directly but it's much easier to
|| first convert it to a text format (Matches what you see in the
|| browser).
||
|| So far I have used the Windows port of the links console browser
|| with the "-dump" option to do the conversion. But it's an old
|| unmaintained port that does not seem to work on Windows Vista.
||
|| Does anyone know of some good html to txt console programs for
|| Windows? Open source and native (not requiring e.g. Python) would be
|| a plus.
||
|| Would it perhaps be possible to add something like this in future
|| TCC versions? Seems like it would be a good complement to what's
|| already supported.

You could write a very simple BTM file, which reads the data file as BINARY
one character at a time. Whenever it finds the "<" (less than) character, it
stops passing the output until it is past the (matching) ">" (greater than)
character. You may also find it convenient to drop space characters at the
beginning of a line and after other space characters. This is based on my
(very rudimentary) understanding that in HTML all tags start with "<", and
tags do not embed other tags. You may want to retain some tags or tag
elements, e.g. <p>xxx</p> and href="URL". If you have access to MS Word, it
can interpret an HTML file and save it for you as a text file.
--
HTH, Steve

Joe Caverly · Oct 14, 2009

nikbackm said:
Not sure if this is the right forum for this but...

Does anyone know of some good html to txt console programs for Windows? Open source and native (not requiring e.g. Python) would be a plus.

Hi,
Using the COM interface of Internet Explorer, you can retrieve text content from the web page. Here's how it is done in Microsoft PowerShell;

Code:

$ie = New-Object -ComObject InternetExplorer.Application
$ie.Navigate("http://www.google.ca")
$ie.Document.Body.InnerText
$ie.Quit()

You could also navigate to a file that you have downloaded, for example;

Code:

$ie.Navigate("file:///C:/utils/Google.htm")

You can use VBScript, Visual Basic, or any other COM-capable language, to access a COM object. Microsoft PowerShell makes it very easy to access any COM object from the command line. I usually have Microsoft PowerShell active in its own tab for just such tasks.

While it would be nice to have the internal ability to filter HTML to text in TCMD, this feature could be added via a Plugin, if one were so inclined to write said Plugin.

Not sure if this will work on Vista, but you could also try HTMSTRIP from;

http://users.erols.com/waynesof/bruce.htm

This is a 16-bit DOS program, from many years gone by.

Joe

dcantor · Oct 14, 2009

nikbackm said:
Not sure if this is the right forum for this but...

Does anyone know of some good html to txt console programs for Windows? Open source and native (not requiring e.g. Python) would be a plus.

You might try using Lynx. (See http://lynx.isc.org/)

dbartt · Oct 14, 2009

> So far I have used the Windows port of the links console browser with the "-dump" option to do the conversion. But it's an old unmaintained port that does not seem to work on Windows Vista.
>

Do you have a recent version of Lynx? I'm using version 2.8.5 with Vista and it
works fine. I use it to retrieve some web pages and "-dump" them to text files.
There are many different Win32 ports of Lynx and the quality is inconsistent.
The 2.8.5 version I'm using came from http://fredlwm.iblogger.org/lynx/ but they
now have a version 2.8.7 which I haven't tried.

Dennis

samintz · Oct 14, 2009

Check out Pure Text by Steve Miller. It allows you to paste just text
from the clipboard. It's extremely handy.

http://www.SteveMiller.net/puretext/

Have you ever copied some text from a web page or a document and then
wanted to paste it as simple text into another application without getting
all the formatting from the original source? PureText makes this simple by
adding a new Windows hot-key (default is WINDOWS+V) that allows you to
paste text to any application without formatting.
After running PureText.exe, you will see a "PT" tray icon appear near the
clock on your task bar. You can click on this icon to remove formatting
from the text that is currently on the clipboard. You can right-click on
the icon to display a menu with more options.

-Scott

Steve F$BaC(Bi$BaO(B <> wrote on 10/14/2009 08:18:45 AM:

> nikbackm wrote:
> || Not sure if this is the right forum for this but...
> ||
> || I sometimes use TCC to download HTML pages from web sites, convert
> || them to text and then parse out some wanted information for further
> || processing/presentation.
> ||
> || TCC has builtin support for the first part (COPY with HTTP support)
> || and the last (various text processing functions) but there is no
> || direct support for converting HTML to (readable) text. I could of
> || course try to parse to HTML file directly but it's much easier to
> || first convert it to a text format (Matches what you see in the
> || browser).
> ||
> || So far I have used the Windows port of the links console browser
> || with the "-dump" option to do the conversion. But it's an old
> || unmaintained port that does not seem to work on Windows Vista.
> ||
> || Does anyone know of some good html to txt console programs for
> || Windows? Open source and native (not requiring e.g. Python) would be
> || a plus.
> ||
> || Would it perhaps be possible to add something like this in future
> || TCC versions? Seems like it would be a good complement to what's
> || already supported.
>
> You could write a very simple BTM file, which reads the data file as

BINARY

> one character at a time. Whenever it finds the "<" (less than)

character, it

> stops passing the output until it is past the (matching) ">" (greater

than)

> character. You may also find it convenient to drop space characters at

the

> beginning of a line and after other space characters. This is based on

my

> (very rudimentary) understanding that in HTML all tags start with "<",

and

> tags do not embed other tags. You may want to retain some tags or tag
> elements, e.g.
> xxx
> and href="URL". If you have access to MS Word, it
> can interpret an HTML file and save it for you as a text file.
> --
> HTH, Steve
>
>

nikbackm · Oct 19, 2009

Thanks for all the answers!

There were some interesting options there. For me, I think perhaps going with Lynx might be easiest with my current batch file set.

Search

Welcome!

HTML conversion

nikbackm

Steve Fabian

Joe Caverly

dcantor

dbartt

samintz

Scott Mintz

nikbackm

Similar threads