1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

HTML conversion

Discussion in 'Support' started by nikbackm, Oct 14, 2009.

  1. nikbackm

    Joined:
    May 30, 2008
    Messages:
    194
    Likes Received:
    1
    Not sure if this is the right forum for this but...

    I sometimes use TCC to download HTML pages from web sites, convert them to text and then parse out some wanted information for further processing/presentation.

    TCC has builtin support for the first part (COPY with HTTP support) and the last (various text processing functions) but there is no direct support for converting HTML to (readable) text. I could of course try to parse to HTML file directly but it's much easier to first convert it to a text format (Matches what you see in the browser).

    So far I have used the Windows port of the links console browser with the "-dump" option to do the conversion. But it's an old unmaintained port that does not seem to work on Windows Vista.

    Does anyone know of some good html to txt console programs for Windows? Open source and native (not requiring e.g. Python) would be a plus.

    Would it perhaps be possible to add something like this in future TCC versions? Seems like it would be a good complement to what's already supported.
     
  2. Steve Fabian

    Joined:
    May 20, 2008
    Messages:
    3,520
    Likes Received:
    4
    nikbackm wrote:
    || Not sure if this is the right forum for this but...
    ||
    || I sometimes use TCC to download HTML pages from web sites, convert
    || them to text and then parse out some wanted information for further
    || processing/presentation.
    ||
    || TCC has builtin support for the first part (COPY with HTTP support)
    || and the last (various text processing functions) but there is no
    || direct support for converting HTML to (readable) text. I could of
    || course try to parse to HTML file directly but it's much easier to
    || first convert it to a text format (Matches what you see in the
    || browser).
    ||
    || So far I have used the Windows port of the links console browser
    || with the "-dump" option to do the conversion. But it's an old
    || unmaintained port that does not seem to work on Windows Vista.
    ||
    || Does anyone know of some good html to txt console programs for
    || Windows? Open source and native (not requiring e.g. Python) would be
    || a plus.
    ||
    || Would it perhaps be possible to add something like this in future
    || TCC versions? Seems like it would be a good complement to what's
    || already supported.

    You could write a very simple BTM file, which reads the data file as BINARY
    one character at a time. Whenever it finds the "<" (less than) character, it
    stops passing the output until it is past the (matching) ">" (greater than)
    character. You may also find it convenient to drop space characters at the
    beginning of a line and after other space characters. This is based on my
    (very rudimentary) understanding that in HTML all tags start with "<", and
    tags do not embed other tags. You may want to retain some tags or tag
    elements, e.g. <p>xxx</p> and href="URL". If you have access to MS Word, it
    can interpret an HTML file and save it for you as a text file.
    --
    HTH, Steve
     
  3. Joe Caverly

    Joined:
    Aug 28, 2009
    Messages:
    679
    Likes Received:
    8
    Hi,
    Using the COM interface of Internet Explorer, you can retrieve text content from the web page. Here's how it is done in Microsoft PowerShell;

    Code:
    $ie = New-Object -ComObject InternetExplorer.Application
    $ie.Navigate("http://www.google.ca")
    $ie.Document.Body.InnerText
    $ie.Quit()
    You could also navigate to a file that you have downloaded, for example;

    Code:
    $ie.Navigate("file:///C:/utils/Google.htm")
    You can use VBScript, Visual Basic, or any other COM-capable language, to access a COM object. Microsoft PowerShell makes it very easy to access any COM object from the command line. I usually have Microsoft PowerShell active in its own tab for just such tasks.

    While it would be nice to have the internal ability to filter HTML to text in TCMD, this feature could be added via a Plugin, if one were so inclined to write said Plugin.

    Not sure if this will work on Vista, but you could also try HTMSTRIP from;

    http://users.erols.com/waynesof/bruce.htm

    This is a 16-bit DOS program, from many years gone by.

    Joe
     
  4. dcantor

    Joined:
    May 29, 2008
    Messages:
    507
    Likes Received:
    3
    You might try using Lynx. (See http://lynx.isc.org/)
     
  5. dbartt

    Joined:
    May 29, 2008
    Messages:
    36
    Likes Received:
    0
    Do you have a recent version of Lynx? I'm using version 2.8.5 with Vista and it
    works fine. I use it to retrieve some web pages and "-dump" them to text files.
    There are many different Win32 ports of Lynx and the quality is inconsistent.
    The 2.8.5 version I'm using came from http://fredlwm.iblogger.org/lynx/ but they
    now have a version 2.8.7 which I haven't tried.

    Dennis
     
  6. samintz

    samintz Scott Mintz

    Joined:
    May 20, 2008
    Messages:
    1,187
    Likes Received:
    11
    Check out Pure Text by Steve Miller. It allows you to paste just text
    from the clipboard. It's extremely handy.

    http://www.SteveMiller.net/puretext/

    Have you ever copied some text from a web page or a document and then
    wanted to paste it as simple text into another application without getting
    all the formatting from the original source? PureText makes this simple by
    adding a new Windows hot-key (default is WINDOWS+V) that allows you to
    paste text to any application without formatting.
    After running PureText.exe, you will see a "PT" tray icon appear near the
    clock on your task bar. You can click on this icon to remove formatting
    from the text that is currently on the clipboard. You can right-click on
    the icon to display a menu with more options.

    -Scott

    Steve F$BaC(Bi$BaO(B <> wrote on 10/14/2009 08:18:45 AM:


    BINARY

    character, it

    than)

    the

    my

    and

     
  7. nikbackm

    Joined:
    May 30, 2008
    Messages:
    194
    Likes Received:
    1
    Thanks for all the answers!

    There were some interesting options there. For me, I think perhaps going with Lynx might be easiest with my current batch file set.
     

Share This Page