[Catalyst] HTML to plain text conversion

Xavier Robin robin0 at etu.unige.ch
Tue Jan 9 17:10:25 GMT 2007


On Monday 08 January 2007 20:19, Peter Karman wrote:
> Xavier Robin scribbled on 1/8/07 11:14 AM:
> > Do you know a (catalyst plugin|perl module|external tool) that converts
> > HTML to plain text? I mean, keeping some formatting (especially lists and
> > links...), not just stripping HTML tags...
>
> I use the w3m tool:
>
>   % w3m -dump file.html > file.txt
>
> I like it because it preserves tables pretty well.

Unfortunately it doesn't print href attributes of links.
I also tried HTML::Scrubber as proposed by Carl Franks, but basically it keeps 
some tags we chose to allow.

In fact, I'm looking for something that could convert my html file to a plain 
text file, so that no markup is allowed at all.

For example, a link like that:

<a href="http://site.example">A link</a>

would be transformed into something like:

A link
http://site.example

I'm sure that a module doing that exists on cpan.

Thanks,
Xavier
-- 
Some people says that if you play a Windows XP install CD backwards you will 
hear demon voices commanding you to worship Satan. But that's nothing. If you 
play it forward it will install Windows XP.



More information about the Catalyst mailing list