[Dbix-class] DBIx::Class field metadata and validation

Sat Jul 30 17:56:50 CEST 2005

On Fri, Jul 29, 2005 at 01:30:12PM -0700, Dan Kubb wrote:
> I've got an idea on how DBIx::Class could handle field metadata
> and validation and I wanted to get some input on it -- I'm not
> really sure what the best approach is, but I wanted to spark some
> discussion anyway.
> 
> Some of frustrations I have with Class::DBI seem to stem from
> the fact that field's metadata isn't stored in a organized way.
> For example there are hacks like accessor_name() that map the
> field name onto a database column name.

People have asked me a few times now why _columns and _primaries are hashrefs
of hashrefs.

This is why.

I figured we'd probably want to store something in there at some point -
relationships are handled this way, and store their f_class, join condition
and attrs in there.

There's no reason though, why you couldn't do (e.g.)

__PACKAGE__->_columns->{foo}->{field_class} = 'DBIx::Class::Type::string';

or, if we end up with an object representing a field,

__PACKAGE__->_columns->{foo}->{field_object} = new DBIx::Class::Type::string;

I'd prefer not to replace the hashref value of the 'foo' key with the
object/class altogether because I'd like several extensions of this type to
be able to co-operate.

> At the same time there's no easy way for me to attach other
> metadata to a specific field, like specifying a default value,
> or the data type, or some other constraints.

I was expecting

__PACKAGE__->_columns->{foo} = { type => 'string', max_length => 30 }

or similar as the "usual way of doing things", although your approach below
is perhaps a better one.

> To try and solve this I've hacked up some basic code thats not
> quite ready yet for posting to the list, but I wanted to outline
> my approach to handling field metadata anyway.. I'd love to get
> some feedback before I get too deeply into this:
> 
>   - Each field is put in its own unique class with the naming
>     convention: $table_class::$field_name, so if I have a table
>     called "contact" specified in the class Local::Model::Contact,
>     the name_first field would be defined in the class
>     Local::Model::Contact::name_first

I'd suggest Local::Model::Contact::_columns::name_first, if you don't
mind (the _ being to separate it from non-auto-generated classes). I'd like
to put this convention in place because I'd like to e.g. support a syntax

__PACKAGE__->add_view('name_count',
  { 'name' => 'name', 'count' => 'COUNT(*)' }, { group_by => 'name' });

which automatically creates a subclass called My::Table::_views::name_count.

You could then retrieve records from the view by doing

$class->search_view('name_count', { ... cond ... });

At some point we may also want a class to hold metadata for the relationship
- this has the interesting possibility of having a smart class that knows
which tables its related to and can provide the join condition etc. either
way. Dunno yet, ideas welcome :)

>   - Field classess can inherit from one or more classes that
>     define the metadata and validation rules for a specific type
>     of data.  For example, there might be a Type class for strings,
>     which provides properties like length_min, length_max, format
>     (for a regex the column values must match), etc.

Very nice. Please also make sure you're able to support aggregates such
as COUNT and MAX, and possibly things like SUBSTR(...) and other exprs
- this may at some point come in handy for presenting the same class interface
after the database has changed (ok, just do it in software, but sometimes the
db is better at such things).

>   - The Type classes use Class::Data::Inheritable to specify the
>     properties they provide.
>
>   - A Field class is defined like so:
> 
>       package Local::Model::Contact::name_first;
> 
>       use base qw/
>         DBIx::Class:::Type::database_column
>         DBIx::Class:::Type::string
>       /;
> 
>       __PACKAGE__->name('name_first');
> 
>       __PACKAGE__->length_min(1);
>       __PACKAGE__->length_max(50);
> 
>       __PACKAGE__->allow_null(0);
> 
>       __PACKAGE__->table('Local::Model::Contact');
>       __PACKAGE__->database_column_name('name_first');
> 
>       1;
> 
>   - Normally typing all this in would be really tedious, so I'm
>     using an approach similar to Sebastian's Class::DBI::Loader
>     where I iterate over all the tables in a database, and create
>     the code like the above for each column.  This would be
>     optional of course.

Nah, naff way of doing it. Intercept ->add_columns (since the compat
interfaces call it in the end anyway) and create the classes as you
go along - plus stash a reference back to them in _columns->{colname}.

You might also consider making add_columns also accept (e.g.)

__PACKAGE__->add_columns({ name_first =>
                              { type => string, length_min => 1, ... }, ... });

It occurs to me that by using a hash we don't store column order. If that
becomes important, remember there's nothing stopping you making
$class->_columns a ref to an IxHash or similar.

>   - You're free to add further properties for things that cannot
>     be inferred from the database in your model code, such as
>     regex patterns that must match, or a human readable label.

Hence why I'd like it included in add_columns :)

>     You could also specify inflate/deflate routines for the
>     column here as well.

I think I'd prefer the inflate/deflate routines to be methods on the type
- but there's no reason you can't provide default inflate/deflate subs that
either do nothing or delegate to a provided subref if available.

Also remember that assuming the type classes are written as MI/NEXT.pm
components, there's no reason you can't compose them -

__PACKAGE__->add_columns({ foo =>
                           { type_class =>
                               [ qw/database_column sub_flate string/ ] } });

which will go away and do

package My::Table::_columns::foo;

use base qw/DBIx::Class::Type::database_column
            DBIx::Class::Type::sub_flate
            DBIx::Class::Type::string/;

...

>   - Each of the DBIx::Class::Type::* classes has a validate()
>     method that checks the data type. The string class'
>     validate() method looks like this:
> 
>       sub validate : method {
>         my $self = shift;
> 
>         return
>           $self->__validate_is_string,
>           $self->__validate_with_length_min,
>           $self->__validate_with_length_max,
>           $self->__validate_with_allowed_chars,
>           $self->__validate_with_disallowed_chars,
>           $self->__validate_with_format,
>           $self->__validate_with_encoding,
>           $self->NEXT::DISTINCT::validate;
>       }

Lovely, although perhaps break this out into a number of composable classes
so that people who don't need checks don't need them - also encoding tends
to be a complex issue (potentially related to inflate/deflate, I suspect)
and you probably want to be able to plug in implementations depending on
what you're doing (and what database you're using).

>     If a validation rule fails, it returns an array of hashrefs
>     of errors. The hashrefs contain a name, a message, and
>     possibly some data to help explain what failed the validation
>     rule if it can't be found in another way.

Yep, definitely the way to go - although you might also want to look at
leaving hooks in case someone wants a DBIx::Class::Error::* object for
either the full error collection or per error.

>     I toyed with the idea of throwing an exception when a rule
>     fails, but I would rather have all the rules execute and know
>     everything thats wrong with the data in one go.  What to do
>     with the errors is punted to the caller. ;)

Yes, although allowing an *option* to short-circuit on the first failure
might be handy.

How to deal with the errors, and probably the Error object creation above
if appropriate, should be handled by the extension that does the validation
work, I'd say.

>   - Fields don't necessarily need to be columns in a database;
>     but they can have most of the same properties.

Yep, which means it should be possible to have fields on an object that
aren't in the db, which would probably be a much nicer solution than
the TEMP column group under cdbi.

>   - Foreign key fields can inherit from the primary key field
>     of the table they link to to show the relationship between
>     them.

Have a look at my comments on relationships above - maybe if you had a
relationship class which linked to all field classes involved and vice
versa? (the inheritance idea is also very cool though - do both :)

> Here's the DBIx::Class::Type::.* classes I've made so far and
> the properties each provides:
> 
> <snip />
>
> So far most of this is working great. Of course it really doesn't
> do much on its own, but it provides a lot of information that
> DBIx::Class (and other classes) can use.

Looks like a pretty good start to me.

> The one thing I'm really not sure about is that I've made it so
> each column value is a blessed scalar in its Field class. That
> way the data is stored, but I can easily get at the property
> values if there are needed.  I'm concerned about performance and
> memory usage though, although it allows you to do nice things
> like this once an object is instantiated:
> 
>   # get the maximum length for the field
>   my $length_max = $obj->name_first->length_max;

Yeah ... I think this behaviour needs to be optional so people can
avoid the overhead if they don't want it. Also: The column value must
*always* be the real value, or at least get_column must return the real
value when called. This allows things to make assumptions and is a performance
gain - look at how I implemented get/set/store_has_a and create
get/set/store_field_object methods (or other name to test), then re-create
the accessors in the field_object group if you want object return for the
accessor methods (D::C::AccessorGroup *will* add accessors for which a method
is already present - this lets you shoot yourself in the foot but is
also substantially more flexible).

> Scalar objects are pretty light-weight so I'm not sure it would
> make much of a difference either way, but I do like the idea of
> keeping all the properties close to the column. Using a simple
> AUTOLOAD would allow pass-through to underlying object calls as
> long as there wasn't a collision with the method names.
> (NOTE: I haven't quite gotten to working with objects yet).

That's fair enough. You might also want to consider storing a weak ref
back to the parent object so you can proxy the value through if reqd.
- that way a value object grabbed from a row object will keep stringifying
to the correct value even if the row is updated. This probably isn't always
what you want but it might at least be useful to have $value_obj->refresh
or similar.

Really, really, really look at the Tangram Expr objects here; I think there's
a common base for such objects that we can factor out and use in both projects.

That probably means you should have a look at the Tangram::Type stuff as well.

> Worst case though, as long as the table knew the class names for
> each column, it could use the validation methods on its own. I
> could handle not making the object a blessed Scalar; but I do
> like the syntax it allows you ;)

Agreed, and if they're under the _column stuffs you should be able to call

$class->_columns->{$col}->{field_class}->validate($value)

> Here's a few nice side effects I can think of with this system:
> 
> Default values:
> 
> We could know what the default value of a field should be without
> having to create a corresponding record in the database first.
> 
> Query optimization:
> 
> If someone performs a search, where the string 'foo' is used for
> an integer column, we can skip going to the database and just
> return no results, since we know there can't be any.
> 
> Likewise if a character column has to match a regex like
> qr/\A[A-Z]+\z/, and must be beween 5 and 10 characters in length,
> and the supplied value is either "BAR" or "foobar" we can also
> return immediately.
> 
> Also, if a field is NOT NULL, but undef is supplied for the
> value, then we can return immediately.

Yes, assuming the conditions involved are 'AND' and not 'OR'. Bear in mind
the existence of DBIx::Class::SQL::Abstract and the probable arrival at
some stage of DBIx::Class::SQL::Routine :)

> Database Table Creation:
> 
> If tables and columns can be described in a rich enough way, then
> it should be possible to make CREATE TABLE statements based on
> the descriptions of the columns in the perl code. You should be
> able to just change the DSN and rerun a script to re-create
> everything with a different database engine.
> 
> This will require a way of richly describing things at the table
> level, but I think it is do-able.

Have a good look at SQL::Routine and its schema support; all the power you
need is, I believe, already there.

> Simplified code:
> 
> I'm still getting my head wrapped around DBIx::Class, but I'm
> pretty sure that when (if) I refactor DBIx::Class to use this
> module some things will become simpler in the code.

I'd like to keep this stuff optional (in the spirit of being able to drop
in whatever features *your* app needs from its ORM) but I'd be very interested
to see a recipe that adds support for this (possibly DBIx::Class::FieldClasses
as the recipe and the extra extensions involved under this namespace).

If you can work up a diff against the trunk I'd be willing to give you
a commit bit to put this into the main dist; if people are happy with the
idea I think this system is probably going to become the standard way to
do things; I was assuming we'd need type classes eventually anway
in order to infer joins and to add the rest of the features you outline
above.

> Filters:
> 
> Once there is a way of handling things at the column level, it
> should be fairly easy to make something that pipes the data
> through a chain of filters that can remove leading/trailing
> whitespace, or properly case words.

Chain 'em by MI/NEXT :)

use base qw/DBIx::Class::TypeFilter::Foo/;

...

package DBIx::Class::TypeFilter::Foo;

sub inflate_filter {
  return shift->NEXT::inflate_filter(frobnicate_fooly(@_));
}

> Anyway, thats all for now...  The main point I want to get across
> is that the level of detail this allows should be optional.  I
> think I can derive the critical information from interface
> DBIx::Class provides right now.  This just provides a way of
> going deeper and specifying the properties each field should
> have at a more granular level.

Agreed entirely. Thanks very much, and I look forward to seeing the code
in trunk at some point :)

-- 
     Matt S Trout           Website: http://www.shadowcatsystems.co.uk
  Technical Director        E-mail:  mst (at) shadowcatsystems.co.uk
Shadowcat Systems Ltd.