[Xml-compile] Ordering causes Reader to fail ?

Anthony Wood bessington at gmail.com
Wed Oct 3 01:07:29 GMT 2012


Hi again

I'm getting the following error from a read of a certain ALTO xml file,
whereas other similar ALTO files are parsed correctly with the same code.:

error: element `{http://www.loc.gov/standards/alto/ns-v2#}TextBlock' not
processed at {
http://www.loc.gov/standards/alto/ns-v2#}alto/Layout/Page/PrintSpace

    use XML::Compile::Schema;
    use XML::Compile::Translate::Reader;
    use XML::Compile::Translate::Template;
    use XML::Compile::Translate::Writer;

    my $schema_in =3D
XML::Compile::Schema->new('/cygdrive/h/tony.wood/xml/xsd/alto-v2.0.xsd);
    $schema_in->importDefinitions('xlink.xsd');

    use XML::Compile::Util qw/pack_type/;
    my $type =3D pack_type 'http://www.loc.gov/standards/alto/ns-v2#', 'alt=
o';

    my $read =3D $schema_in->compile(READER =3D> $type);

    my $alto_data =3D $read->( $alto_filespec );    # <-- error in here

I've done a binary chop on the data and the problem seems to be triggered
by the *ordering* of String / SP elements withing a TextLine

The following is the normal case (99% + ) and is parsed OK



alto  =3D>  Layout =3D> Page =3D>  PrintSpace =3D> TextBlock =3D> TextLine =
=3D>
<String/><SP/>...





If there are *any *TextLines which match the following, the file is
rejected with the error above



alto  =3D>  Layout =3D> Page =3D>  PrintSpace =3D> TextBlock =3D> TextLine =
=3D>
<SP/><String/>...



Jhove still accepts files containing the case above as Well Formed


Here are 2 examples:, the first is a typical good case, the second causes
the problem



<TextLine HEIGHT=3D"29" WIDTH=3D"422" VPOS=3D"838" HPOS=3D"161"><String
CONTENT=3D"Sold" HEIGHT=3D"24" WIDTH=3D"54" VPOS=3D"842" HPOS=3D"161"/><SP =
WIDTH=3D"8"
VPOS=3D"842" HPOS=3D"216"/><String CONTENT=3D"in" HEIGHT=3D"23" WIDTH=3D"24"
VPOS=3D"842" HPOS=3D"225"/><SP WIDTH=3D"7" VPOS=3D"842" HPOS=3D"250"/><Stri=
ng
CONTENT=3D"130XC9," HEIGHT=3D"25" WIDTH=3D"78" VPOS=3D"842" HPOS=3D"258"/><=
SP
WIDTH=3D"10" VPOS=3D"844" HPOS=3D"337"/><String CONTENT=3D"Is," HEIGHT=3D"1=
9"
WIDTH=3D"28" VPOS=3D"844" HPOS=3D"348"/><SP WIDTH=3D"11" VPOS=3D"846"
HPOS=3D"377"/><String CONTENT=3D"and" HEIGHT=3D"23" WIDTH=3D"46" VPOS=3D"84=
0"
HPOS=3D"389"/><SP WIDTH=3D"12" VPOS=3D"840" HPOS=3D"436"/><String CONTENT=
=3D"2a."
HEIGHT=3D"21" WIDTH=3D"34" VPOS=3D"842" HPOS=3D"449"/><SP WIDTH=3D"10" VPOS=
=3D"845"
HPOS=3D"484"/><String CONTENT=3D"c*=E2=80=98i" HEIGHT=3D"16" WIDTH=3D"26" V=
POS=3D"845"
HPOS=3D"495"/><SP WIDTH=3D"14" VPOS=3D"846" HPOS=3D"522"/><String CONTENT=
=3D"and"
HEIGHT=3D"22" WIDTH=3D"46" VPOS=3D"838" HPOS=3D"537"/></TextLine>

<!-- error on next line : -->

<TextLine HEIGHT=3D"29" WIDTH=3D"268" VPOS=3D"866" HPOS=3D"315"><SP WIDTH=
=3D"180"
VPOS=3D"873" HPOS=3D"315"/><String CONTENT=3D"/ill" HEIGHT=3D"22" WIDTH=3D"=
32"
VPOS=3D"866" HPOS=3D"496"/><SP WIDTH=3D"9" VPOS=3D"867" HPOS=3D"529"/><Stri=
ng
CONTENT=3D"pay" HEIGHT=3D"23" WIDTH=3D"44" VPOS=3D"872" HPOS=3D"539"/></Tex=
tLine>



The relevant part of the template is :

{ # sequence of Description, Styles, Layout

  # is an unnamed complex
  Layout =3D>
  { # sequence of Page

    # is an unnamed complex
    # occurs 1 <=3D # <=3D unbounded times
    Page =3D>
    [ { # sequence of TopMargin, LeftMargin, RightMargin,
        #   BottomMargin, PrintSpace

        # is a x0:PageSpaceType
        # is optional
        PrintSpace =3D>
        { # sequence of choice
          # occurs any number of times
          seq_BlockGroup =3D>
          [ {
              # choice of TextBlock, Illustration, GraphicalElement,
              #   ComposedBlock

              # is a x0:TextBlockType
              TextBlock =3D>
              { # sequence of Shape
                # is optional

                # is an unnamed complex
                # occurs 1 <=3D # <=3D unbounded times
                TextLine =3D>
                [ { # sequence of seq_String, HYP

                    # sequence of String, SP
                    # occurs 1 <=3D # <=3D unbounded times
                    seq_String =3D>
                    [ {
                        # is a x0:StringType
                        String =3D>
                        { # sequence of ALTERNATIVE
                          # is optional

                          # ALTERNATIVE is simple value with attributes
                          # is an unnamed complex
                          # occurs 1 <=3D # <=3D unbounded times
                          ALTERNATIVE =3D>
                          [ { # is a xs:string
                              PURPOSE =3D> "example",

                              # string content of the container
                              # is a xs:string
                              _ =3D> "example", }, ],

                          # is a xs:ID
                          ID =3D> "id_0",

                          # is a xs:IDREFS
                          STYLEREFS =3D> "labels",

                          # is a xs:float
                          HEIGHT =3D> 3.1415,

                          # is a xs:float
                          WIDTH =3D> 3.1415,

                          # is a xs:float
                          HPOS =3D> 3.1415,

                          # is a xs:float
                          VPOS =3D> 3.1415,

                          # is a xs:string
                          # attribute CONTENT is required
                          # white-space preserve
                          CONTENT =3D> "example",

                          # a list of values, where each
                          # is a xs:string
                          # Enum: bold italics smallcaps subscript
superscript
                          #    underline
                          STYLE =3D> [ "bold" , ... ],

                          # is a xs:string
                          # Enum: Abbreviation HypPart1 HypPart2
                          SUBS_TYPE =3D> "HypPart1",

                          # is a xs:string
                          SUBS_CONTENT =3D> "example",

                          # is a xs:float
                          # value <=3D 1
                          # value >=3D 0
                          WC =3D> 3.1415,

                          # is a xs:string
                          CC =3D> "example", },

                        # is an unnamed complex
                        # is optional
                        SP =3D>
                        { # is a xs:ID
                          ID =3D> "id_0",

                          # is a xs:float
                          WIDTH =3D> 3.1415,

                          # is a xs:float
                          HPOS =3D> 3.1415,

                          # is a xs:float
                          VPOS =3D> 3.1415, }, },
                    ],


Is there a workaround ? I have little control over the input files.

many thanks in advance
Tony Wood
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.scsys.co.uk/pipermail/xml-compile/attachments/20121003/d5=
e0831f/attachment.htm


More information about the Xml-compile mailing list