[Xml-compile] Ordering causes Reader to fail ?
Anthony Wood
bessington at gmail.com
Wed Oct 3 01:07:29 GMT 2012
Hi again
I'm getting the following error from a read of a certain ALTO xml file,
whereas other similar ALTO files are parsed correctly with the same code.:
error: element `{http://www.loc.gov/standards/alto/ns-v2#}TextBlock' not
processed at {
http://www.loc.gov/standards/alto/ns-v2#}alto/Layout/Page/PrintSpace
use XML::Compile::Schema;
use XML::Compile::Translate::Reader;
use XML::Compile::Translate::Template;
use XML::Compile::Translate::Writer;
my $schema_in =3D
XML::Compile::Schema->new('/cygdrive/h/tony.wood/xml/xsd/alto-v2.0.xsd);
$schema_in->importDefinitions('xlink.xsd');
use XML::Compile::Util qw/pack_type/;
my $type =3D pack_type 'http://www.loc.gov/standards/alto/ns-v2#', 'alt=
o';
my $read =3D $schema_in->compile(READER =3D> $type);
my $alto_data =3D $read->( $alto_filespec ); # <-- error in here
I've done a binary chop on the data and the problem seems to be triggered
by the *ordering* of String / SP elements withing a TextLine
The following is the normal case (99% + ) and is parsed OK
alto =3D> Layout =3D> Page =3D> PrintSpace =3D> TextBlock =3D> TextLine =
=3D>
<String/><SP/>...
If there are *any *TextLines which match the following, the file is
rejected with the error above
alto =3D> Layout =3D> Page =3D> PrintSpace =3D> TextBlock =3D> TextLine =
=3D>
<SP/><String/>...
Jhove still accepts files containing the case above as Well Formed
Here are 2 examples:, the first is a typical good case, the second causes
the problem
<TextLine HEIGHT=3D"29" WIDTH=3D"422" VPOS=3D"838" HPOS=3D"161"><String
CONTENT=3D"Sold" HEIGHT=3D"24" WIDTH=3D"54" VPOS=3D"842" HPOS=3D"161"/><SP =
WIDTH=3D"8"
VPOS=3D"842" HPOS=3D"216"/><String CONTENT=3D"in" HEIGHT=3D"23" WIDTH=3D"24"
VPOS=3D"842" HPOS=3D"225"/><SP WIDTH=3D"7" VPOS=3D"842" HPOS=3D"250"/><Stri=
ng
CONTENT=3D"130XC9," HEIGHT=3D"25" WIDTH=3D"78" VPOS=3D"842" HPOS=3D"258"/><=
SP
WIDTH=3D"10" VPOS=3D"844" HPOS=3D"337"/><String CONTENT=3D"Is," HEIGHT=3D"1=
9"
WIDTH=3D"28" VPOS=3D"844" HPOS=3D"348"/><SP WIDTH=3D"11" VPOS=3D"846"
HPOS=3D"377"/><String CONTENT=3D"and" HEIGHT=3D"23" WIDTH=3D"46" VPOS=3D"84=
0"
HPOS=3D"389"/><SP WIDTH=3D"12" VPOS=3D"840" HPOS=3D"436"/><String CONTENT=
=3D"2a."
HEIGHT=3D"21" WIDTH=3D"34" VPOS=3D"842" HPOS=3D"449"/><SP WIDTH=3D"10" VPOS=
=3D"845"
HPOS=3D"484"/><String CONTENT=3D"c*=E2=80=98i" HEIGHT=3D"16" WIDTH=3D"26" V=
POS=3D"845"
HPOS=3D"495"/><SP WIDTH=3D"14" VPOS=3D"846" HPOS=3D"522"/><String CONTENT=
=3D"and"
HEIGHT=3D"22" WIDTH=3D"46" VPOS=3D"838" HPOS=3D"537"/></TextLine>
<!-- error on next line : -->
<TextLine HEIGHT=3D"29" WIDTH=3D"268" VPOS=3D"866" HPOS=3D"315"><SP WIDTH=
=3D"180"
VPOS=3D"873" HPOS=3D"315"/><String CONTENT=3D"/ill" HEIGHT=3D"22" WIDTH=3D"=
32"
VPOS=3D"866" HPOS=3D"496"/><SP WIDTH=3D"9" VPOS=3D"867" HPOS=3D"529"/><Stri=
ng
CONTENT=3D"pay" HEIGHT=3D"23" WIDTH=3D"44" VPOS=3D"872" HPOS=3D"539"/></Tex=
tLine>
The relevant part of the template is :
{ # sequence of Description, Styles, Layout
# is an unnamed complex
Layout =3D>
{ # sequence of Page
# is an unnamed complex
# occurs 1 <=3D # <=3D unbounded times
Page =3D>
[ { # sequence of TopMargin, LeftMargin, RightMargin,
# BottomMargin, PrintSpace
# is a x0:PageSpaceType
# is optional
PrintSpace =3D>
{ # sequence of choice
# occurs any number of times
seq_BlockGroup =3D>
[ {
# choice of TextBlock, Illustration, GraphicalElement,
# ComposedBlock
# is a x0:TextBlockType
TextBlock =3D>
{ # sequence of Shape
# is optional
# is an unnamed complex
# occurs 1 <=3D # <=3D unbounded times
TextLine =3D>
[ { # sequence of seq_String, HYP
# sequence of String, SP
# occurs 1 <=3D # <=3D unbounded times
seq_String =3D>
[ {
# is a x0:StringType
String =3D>
{ # sequence of ALTERNATIVE
# is optional
# ALTERNATIVE is simple value with attributes
# is an unnamed complex
# occurs 1 <=3D # <=3D unbounded times
ALTERNATIVE =3D>
[ { # is a xs:string
PURPOSE =3D> "example",
# string content of the container
# is a xs:string
_ =3D> "example", }, ],
# is a xs:ID
ID =3D> "id_0",
# is a xs:IDREFS
STYLEREFS =3D> "labels",
# is a xs:float
HEIGHT =3D> 3.1415,
# is a xs:float
WIDTH =3D> 3.1415,
# is a xs:float
HPOS =3D> 3.1415,
# is a xs:float
VPOS =3D> 3.1415,
# is a xs:string
# attribute CONTENT is required
# white-space preserve
CONTENT =3D> "example",
# a list of values, where each
# is a xs:string
# Enum: bold italics smallcaps subscript
superscript
# underline
STYLE =3D> [ "bold" , ... ],
# is a xs:string
# Enum: Abbreviation HypPart1 HypPart2
SUBS_TYPE =3D> "HypPart1",
# is a xs:string
SUBS_CONTENT =3D> "example",
# is a xs:float
# value <=3D 1
# value >=3D 0
WC =3D> 3.1415,
# is a xs:string
CC =3D> "example", },
# is an unnamed complex
# is optional
SP =3D>
{ # is a xs:ID
ID =3D> "id_0",
# is a xs:float
WIDTH =3D> 3.1415,
# is a xs:float
HPOS =3D> 3.1415,
# is a xs:float
VPOS =3D> 3.1415, }, },
],
Is there a workaround ? I have little control over the input files.
many thanks in advance
Tony Wood
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.scsys.co.uk/pipermail/xml-compile/attachments/20121003/d5=
e0831f/attachment.htm
More information about the Xml-compile
mailing list