Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algebraic Data Types #1448

Closed
7 tasks done
yannham opened this issue Jul 12, 2023 · 13 comments
Closed
7 tasks done

Algebraic Data Types #1448

yannham opened this issue Jul 12, 2023 · 13 comments
Assignees
Labels
question Further information is requested type: feature request

Comments

@yannham
Copy link
Member

yannham commented Jul 12, 2023

Algebraic Data Types

Algebraic data types (ADTs) - also called tagged unions - and the ability to pattern match on them is a common and very well appreciated feature of most typed functional languages. If record types can be understood as product types, that is a logical and ({n : Number, s : String} is a string and a number together), ADTs are sum types, corresponding to or. Having both of them allow to represent most common shapes of data.

Nickel enums are a very limited form of tagged unions, which can't hold any data, but have only a tag.

Motivation

ADTs are widely useful, so it's difficult to write a precise and exhaustive list of their applications, but we missed them in many occasions while working on Nickel code:

  • Represent nullable values (especially in statically typed code), although this one might be better special cased in the language as a bare (untagged) union
  • Representing errors (Rust/OCaml/Haskell's Result)
  • Writing folds. Nickel doesn't have a looping construct, hence iteration is performed using fold functions such as std.array.fold. The "state" of the loop is passed along, and it's been often the case that this state is naturally an ADT.
  • Better represent the possible return values of some stdlib functions, and more generally user-defined helper functions. Typically, std.string.find returns {matched = "", index = -1, groups = []} if the pattern isn't found in the string, which is not very elegant.

Representation

Currently, all Nickel primitive values (numbers, strings, booleans, arrays and records) directly map to an equivalent in the JSON data model. This is deliberate as JSON interop is of high importance.

This property isn't true for ADTs. Take a hypothetical Nickel ADT: type Either = 'Left Number | 'Right String. A reasonable encoding in JSON could be something like a record {tag, value} where tag : [| 'Left, 'Right |] and value is either of type Number or String, depending on the tag.

This is in fact how we've encoded tagged union in Nickel in practice, up to now (of course, by doing it manually, we don't have any pattern matching capability and we can represent many invalid values).

This encoding isn't unique. For once, unions are used in the wild in configurations, but they are not necessarily tagged: a more natural and direct representation of Either could be just one value that is either a string or a number, corresponding to the contract:

Either = std.contract.from_predicate
  (fun value => std.is_number value || std.is_string value)

Second, even if the tag+value encoding is used, the choice of field names is arbitrary. We could have chosen discriminant and data for example.

Finally, ADTs usually allow multiple arguments, for example 'Binop Ast Ast, which are harder to translate using a fixed scheme. Although I advocate for having 1-ary ADTs at most and use records to represent multiple arguments anyway.

In conclusion, there doesn't seem to be a canonical encoding of ADTs into the JSON data model. Because unions (tagged or not) are used in the wild, this means in particular that choosing an arbitrary encoding could hurt interop with existing configurations.

Possible solutions

Union types

The first idea we explored a long time ago was to offer a somehow lower-level construct, (untagged) union types. Together with some magic around branching (flow typing), one can achieve a similar level of expressivity and ergonomics as with ADTs. This is the approach of TypeScript. Bare unions also fit well in a structural type system like the one of Nickel.

By writing a union, the user specifies the encoding at the same time, relieving Nickel from doing an arbitrary choice.

Nickel's gradual typing discipline requires that we derive a contract from any static type, and thus that we can derive general union contracts. Alas, union contracts are problematic and rather incompatible with lazyness1.

Dependent records

After the union contracts dead-end, we started to explore the route of adding a very limited form of depending record types, that would allow to express the encoding with a tag and a value, still in the spirit of letting the user specify the representation together with the type while providing ADT-like capabilities.

This work was stopped when @teofr left and haven't been worked on since. Dependent record types might still be a valid solution, but even restricted forms of dependent types have non trivial consequence on the type system and the implementation of the typechecker.

Pick an arbitrary encoding

We could bite the bullet and choose an arbitrary encoding, in spite of the non-canonicity.

Don't serialize ADTs

Another possible stance is to look back at our motivation section, and realize that most of the time, we need ADTs to model internal values that are passed to and return from helper functions. Among the experiments done by the Nickel team, all use-cases where we needed ADTs were values that weren't exported nor imported (this is not true of the nullable use-case, but this one probably deserve an ad hoc type anyway, because it's so pervasive and because we don't want to serialize it as a tagged union but as a bare union).

If ADTs are mostly used internally, the argument around (de)serializing them is less important. So, as for functions, we could simply make ADTs non serializable. Being able to (de)serialize unions is then a separate issue, that can be handled purely on the contract side (using proposals of limited union contracts from the Union and Intersection Contracts are Hard paper for example).

If users really need to export an ADT value, the export_with metadata proposal would help (the idea is to be able to provide a normalization/formatting/post-processing function called before export), as the user could provide a function from their ADT to a serializable value. Deserialization can be achieved simply through contracts, or we could have a dual of import_with.

Proposal

We showed that union types aren't a viable approach in Nickel (it's not technically impossible, but the cost of doing so is just too high on many levels).

Dependent record types are an interesting direction, but with a lot of uncertainty around both the feasibility and the ergonomics (I would dare to say random programmers aren't really familiar with dependent types yet).

Thus, I'm partial to the two last propositions, which aren't very different in the end. If we implement export_with, or any kind of equivalent metadata, the user can control the export if needed, so it might be ok to chose an arbitrary encoding by default.

Or we can pick the last proposition (don't serialize ADTs), but provide a std.enum.encode function as for a default encoder, always forcing the user to explicitly specify an encoding.

Structural vs nominal

Most ADTs in the wild (in other languages) are nominal. But Nickel's type system is structural. Thus, structural ADTs could make more sense: one example is OCaml's polymorphic variants.

On the other hand, polymorphic variants require subtyping (including in the source language) to be used as ADTs, which is arguably more complex to implement and to use.

Syntax

A second question is how ADTs would fit in the existing syntax. This section assumes that we go with proper tagged unions (not bare unions or dependent records).

We already have enum types, so it would make sense to enrich them to accept data inside the variants:

{
  MyEitherType = [| 'Right Number, 'Left String |],
}

This syntax would fit particularly well with structural ADTs. If we go with nominal ADTs, we probably need a separate syntax, as the typing would be quite different (typically, an ADT couldn't have a row tail). | is usually picked to delimit alternatives in many functional languages, but it's already used a lot in Nickel for metadata. The separator , might be an appealing alternative:

{
  MyEitherType = data {
    Right Number, # or 'Right Number
    Left String,
  }
  # or
  data MyEitherType = {
    Right Number, # or 'Right Number
    Left String,
  }
}

This last example doesn't look particularly good, it's just fuel to start a discussion. Feel free to add your own suggestion for a syntax for nominal ADTs!

Roadmap

Preview Give feedback

Footnotes

  1. we wrote a paper about the issue of union contracts in a lazy language.

@yannham yannham added question Further information is requested type: feature request labels Jul 12, 2023
@aspiwack
Copy link
Member

I don't think it's wise to use an arbitrary encoding. At least not at this time. Maybe non-serialisable algebraic types is the reasonable step forward. And it's forward compatible with fixing a serialisation for algebraic types in the future.

@vkleen
Copy link
Contributor

vkleen commented Jul 21, 2023

I also prefer non-serialisable ADTs for now. The export_with metadata would only be sugar for calling a function on the ADT to produce a serialisable encoding, so this proposal shouldn't depend on it.

@yannham
Copy link
Member Author

yannham commented Jul 21, 2023

I believe export_with is a bit more than sugar for function application, because it would be specified at the contract level and would be automatically applied to the end result of merging a field. As a user, you wouldn't have to care about using ADTs values or not: the contract takes care of post processing. I don't think we can simulate it adequately in all generality currently (yes, we can require user to convert all their ADT values before serializing them, but it's a bit different; we can have the contract do that, but that means it changes the representation, so for example you can't pattern match on it anymore).

That being said, I don't think this proposal must depend on it either, but having an export_with would make the consideration of serializing ADTs trivial.

@yannham
Copy link
Member Author

yannham commented Jul 21, 2023

For the syntax, another thing that came to mind: I like the anonymous struct/record feature of Rust and OCaml, that is, you can use a record type inside an alternative and give names to the parameters of your variant. It's great for pattern matching, it's not order-dependent, etc. now I almost use them exclusively, except maybe for variants with one argument.

We could enforce that for Nickel by default, having something like 'Right { num : Number}. This is currently valid syntax but is nonsensical (applying a constant to a record type), so it's probably fine to use.

Additionally, ADTs will surely have to be exported by libraries, so separately from the purely type/syntax side, we might want to start thinking about how to define, export and import user-defined types (cf #422)

@Radvendii
Copy link
Contributor

Radvendii commented Jul 31, 2023

I've got an idea to make union types possible. I'm not necessarily advocating this, just putting it out there.

Basically, the idea is to take only what we need from tagged unions, but still be isomorphic to untagged unions. You have to tag your value, but only with the type that it is. (I'm making up ||| and <<< as a syntax for union types and "is of variant" respectively)

let NumberOrString = Number ||| String in
{ foo : NumberOrString = 5 <<< Number }

Perhaps (hopefully!) this could be inferred for types that can be painlessly unioned (like Number and String), so the real use-case would be something like this:

let module = { 
  Left = { a : Number, b : String },
  Right = { a : Number, b : Number },
  RecordUnion = Left ||| Right,
} in
{ 
  foo : module.RecordUnion 
      = { a = 5, b = 3 } <<< module.Right 
}

For this to be practical, we need to be able to bind types to variables, which as I recall is on the todo list but hasn't been implemented yet.

@yannham
Copy link
Member Author

yannham commented Aug 7, 2023

@Radvendii: I wonder if this solution would differ from having native ADTs and serializing them as untagged unions? Because it seems you need to indicate which part of the union your take, so constructors are the same as for a tagged union, at first sight. But I'm not sure to understand completely, though. How do you match on such values? What would be the corresponding runtime contract?

@Radvendii
Copy link
Contributor

Radvendii commented Aug 8, 2023

I wonder if this solution would differ from having native ADTs and serializing them as untagged unions?

The difference is we guarantee that no information is lost in that translation. If I define the ADT (in haskell notation)

data Foo = Opt1 Int | Opt2 Int

and then try to serialize it as an untagged union, those two options get merged and we lose information. The idea that I have is to make such an ADT unrepresentable. since values are tagged by type. It looks like a tagged union but behaves like an untagged union. Trying to construct a tagged union using the notation I used above would look like:

let module = { 
  Left = Number,
  Right = Number,
  TaggedUnion = Left ||| Right,
} in
{ 
  foo : TaggedUnion 
      = 5 <<< module.Right 
}

But this foo is indistinguishable from foo : TaggedUnion = 5 <<< module.Left. They're both equivalent to foo : TaggedUnion = 5 <<< Number.

module.Left and module.Right in my example are just variables referring to the types. They get evaluated out. This could lead to confusion that some well-placed warnings would help with. Basically any time someone tries to make a union with the same type more than once. Maybe that should even be an error, but there are probably use cases where you want to union a type with a generic other type, and then you might want to allow it.

How do you match on such values?

Good question. Presumably we'd need a construct that allows us to match on the different types a union could be. I don't see off the top of my head what would make that impossible. It would look like matching on a tagged union, but as before the tags are actually just types.

What would be the corresponding runtime contract?

You're asking one of two things. I'm not sure which so I'll answer both:

  1. What does it look like to apply a Union type as a contract (i.e. foo | module.RecordUnion)

    For a value to pass that contract, it has to be tagged with a type. Trying to write

    foo | module.RecordUnion = { a = 5, b = 3}
    

    would fail the contract. If a value is tagged with a type, then the contract would in essence be the same as the contract for that type, but it would return a value of the union type.

  2. What does it look like to build a Union contract from contracts (i.e. SomeContract ||| OtherContract)
    In this case, the contracts have to be treated as flat types I think. So it does indeed start looking more like a tagged union. In theory you could wrap the same type in multiple contracts and simulate a tagged union. Serializing this would indeed lose information. This might be to our advantage though. It gives people an escape hatch if they really want tagged unions, but makes it awkward enough that having an awkward default serialization doesn't come as a surprise.

@Radvendii
Copy link
Contributor

Side note: I have a hope in the back of my mind that the <<< notation will be redundant and we can just re-use | and infer the tag from that. But I haven't fully thought through how that would work so I'm continuing to use the explicit <<< for now

@yannham
Copy link
Member Author

yannham commented Aug 8, 2023

The difference is we guarantee that no information is lost in that translation. If I define the ADT (in haskell notation)

But this foo is indistinguishable from foo : TaggedUnion = 5 <<< module.Left. They're both equivalent to foo : TaggedUnion = 5 <<< Number.

I feel like those two sentences are somehow contradictory. What would the "double Number" example be serialized to? If it's serialized as an untagged union, then don't you lose information as well? If both values are indistinguishable, how do you match on them? Or would you reject such an example?

or a value to pass that contract, it has to be tagged with a type. Trying to write
foo | module.RecordUnion = { a = 5, b = 3}
would fail the contract. If a value is tagged with a type, then the contract would in essence be the same as the contract for that type, but it would return a value of the union type.

Then they're not really untagged unions either. I have a hard time understanding, in practice, what do you mean by "tagged by type" and how does it differ in practice from tagged by an ADT constructor, if you need to specify this tag when you build a value. I guess my general question is: what issue would this proposal solve in practice, for example when compared to the ADT + serialize_with solution?

@Radvendii
Copy link
Contributor

Radvendii commented Aug 8, 2023

Or would you reject such an example?

Yes. Likely it would not be allowed. If it were allowed, it would be isomorphic to Number. You would not be able to match on them separately. This is not something that anyone would do on purpose (except maybe as a result of some application of a generic OrNumber type constructor). Mathematically this is an untagged union type.

what issue would this proposal solve in practice, for example when compared to the ADT + serialize_with solution?

For the moment I'm tagging everything explicitly using <<< for clarity. But much of the time, this is unnecessary since nickel should be able to infer what that should be. The problem I'm trying to solve is starting with your paragraph on "Untagged Unions" In the original post, and trying to solve the problems that make that not possible.

@teofr
Copy link
Contributor

teofr commented Aug 13, 2023

Just a few thoughts, but I've been so disconnected with the project that they shouldn't be taken too seriously.

I think calling tagged unions ADTs is a bit confusing, at least for me ADTs is what you get from combining sum and product types. Some languages like Rust have enum types that give you all, but I don't think that's the best way to understand them.

@Radvendii I can't say I completely understood your idea, but it seems very similar to C++'s any and variant. As Yann mentioned on the first message Nickel's structural type system may get something like this to work a bit harder (I think).

I agree that going for the most restrictive (full on ADTs with pattern matching but without serialization) is a good first approach. It also seems that all you really want are contracts over alternatives (union or dependent records are too complex) and pattern matching, have you considered having a discriminated union type/contract, where you would specify which field is the discrimator, and impose some restrictions on the type of the discrimator (it should be flat, maybe just allow enums) and on the fact that each case of the union has to have a unique discriminator. Something like:

{
  MyEitherType = [tag| {tag: 'Right,  value: Number}, {tag:'Left,  value1: String, value2: String} |],
 #                             ^ in here you point the discriminator field
}

And then you could (maybe easily) create contracts and pattern matching for this case:

{
  match x over MyEitherType:
    {tag: 'Right, value} => ...
    {tag: 'Left, value1, value2} => ...
}

@Radvendii
Copy link
Contributor

I don't want to completely dominate this issue discussion, so I tried to write up my idea more clearly in another issue: #1533

@yannham
Copy link
Member Author

yannham commented Mar 8, 2024

ADTs are reasonably implemented in master, with a proper syntax, pattern matching, destructuring, and typechecking. We can still discuss further improvements or challenge the current approach somewhere, but I would consider this as done.

@yannham yannham closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested type: feature request
Projects
None yet
Development

No branches or pull requests

5 participants