Suppose you have a Person
class in your program, and that a Person
has an age
. What type should the age
be?
age
as a String
case class Person(age: String)
"Of course it shouldn't be a String
" you might think. But why? The reason is that we can then end up with code like
val person = Person("Jeff")
If we ever wanted to do anything with an age: String
, we would need to validate it everywhere
def isOldEnoughToSmoke(person: Person): Boolean = {
Try(person.age.toInt) match {
case Failure(_) => throw new Exception(s"cannot parse age '${person.age}' as numeric")
case Success(value) => value >= 18
}
}
def isOldEnoughToDrink(person: Person): Boolean = {
Try(person.age.toInt) match {
case Failure(_) => throw new Exception(s"cannot parse age '${person.age}' as numeric")
case Success(value) => value >= 21
}
}
// etc.
This is cumbersome for the programmer writing the code, and makes it difficult for any programmer reading the code, as well.
We could move this validation to a separate method
def parseAge(age: String): Int = {
Try(age.toInt) match {
case Failure(_) => throw new Exception(s"cannot parse age '$age' as numeric")
case Success(value) => value
}
}
def isOldEnoughToSmoke(person: Person): Boolean =
parseAge(person.age) >= 18
def isOldEnoughToDrink(person: Person): Boolean =
parseAge(person.age) >= 21
...but this is still not ideal. The code is a bit cleaner, but we still need to parse a String
into an Int
every time we want to do anything numeric (comparison, arithmetic, etc.) with the age
. This is often called "stringly-typed" data.
This can also move the program into an illegal state by throwing an Exception
. If we're going to fail anyway, we should fail fast. We can do better.
age
as an Int
Your first instinct might have been to make age
an Int
, rather than a String
case class Person(age: Int)
If so, you have good instincts. An age: Int
is much nicer to work with
def isOldEnoughToSmoke(person: Person): Boolean =
person.age >= 18
def isOldEnoughToDrink(person: Person): Boolean =
person.age >= 21
This
- is easier to write
- is easier to read
- fails fast
You cannot construct an instance of the Person
class with a String
age
now. That is an invalid state. We have made it unrepresentable, using the type system. The compiler will not allow this program to compile.
Problem solved, right?
val person = Person(-1)
This is clearly an invalid state as well. A person cannot have a negative age.
val person = Person(90210)
This is also invalid -- it looks like someone accidentally entered their ZIP code instead of their age.
So how can we constrain this type even further? How can we make even more invalid states unrepresentable?
age
as an Int
with constraints
at runtime
We can enforce runtime constraints in any statically-typed language
case class Person(age: Int) {
assert(age >= 0 && age < 150)
}
In Scala, assert
will throw a java.lang.AssertionError
if the assertion fails.
Now we can be sure that the age
for any Person
will always be within the range [0, 150)
. Both
val person = Person(-1)
and
val person = Person(90210)
will now fail. But they will fail at runtime, halting the execution of our program.
This is similar to what we saw in "age
as a String
", above. This is still not ideal. Is there a better way?
at compile time
Many languages allow compile-time constraints, as well. Usually this is accomplished through macros, which inspect the source code during a compilation phase. These are often referred to as refined types.
Scala has quite good support for refined types across multiple libraries. A solution using the refined
library might look something like
case class Person(age: Int Refined GreaterEqual[0] And Less[150])
A limitation of this approach is that the field(s) to be constrained at compile-time must be literal, hard-coded values. Compile-time constraints cannot be enforced on, for example, values provided by a user. By that point, the program has already been compiled. In this case, we can always fall back to runtime constraints, which is often what these libraries do.
For now, we'll continue with runtime constraints only, since often that's the best we can do.
age
as an Age
with constraints
From simplest to most complex implementation, we moved left to right in the diagram below
String => Int => Int with constraints
This increase in complexity directly correlates with the accuracy with which we're modelling this data
"The problems tackled have inherent complexity, and it takes some effort to model them appropriately." [source]
The move left-to-right above should be driven by the requirements of your system. You should not implement compile-time refinements, for example, unless you have lots of hard-coded values to validate at compile time: otherwise you aren't gonna need it. Every line of code has a cost to implement and maintain. Avoiding premature specification is just as important as avoiding premature generalization, though it's always easier to move from more specific types to less specific types, so prefer specificity over generalization.
Every bit of data has a context, as well. There is no such thing as a "pure" Int
value floating around in the universe. An age can be modelled as an Int
, but it's different from a weight, which could also be modelled as an Int. The labels we attach to these raw values are the context
case class Person(age: Int, weight: Int) {
assert(age >= 0 && age < 150)
assert(weight >= 0 && weight < 500)
}
There is one more problem for us to solve here. Suppose I'm 81kg and 33 years old
val me = Person(81, 33)
That compiles, but... it shouldn't. I swapped my weight and age!
An easy way to avoid this confusion is to define some more types. In this case, newtypes
case class Age(years: Int) {
assert(years >= 0 && years < 150)
}
case class Weight(kgs: Int) {
assert(kgs >= 0 && kgs < 500)
}
case class Person(age: Age, weight: Weight)
The name newtype for this pattern comes from Haskell. This is a simple way to ensure that we don't accidentally swap values with the same underlying type. The following, for example, will not compile
val age = Age(33)
val weight = Weight(81)
val me = Person(weight, age) // does not compile!
We could also use tagged types. In Scala, the simplest possible example of this looks something like
trait AgeTag
type Age = Int with AgeTag
object Age {
def apply(years: Int): Age = {
assert(years >= 0 && years < 150)
years.asInstanceOf[Age]
}
}
trait WeightTag
type Weight = Int with WeightTag
object Weight {
def apply(kgs: Int): Weight = {
assert(kgs >= 0 && kgs < 500)
kgs.asInstanceOf[Weight]
}
}
case class Person(age: Age, weight: Weight)
val p0 = Person(42, 42) // does not compile -- an Int is not an Age
val p1 = Person(Age(42), 42) // does not compile -- an Int is not a Weight
val p2 = Person(Age(42), Weight(42)) // compiles!
val p3 = Person(Weight(42), Weight(42)) // does not compile -- a Weight is not an Age
This makes use of the fact that function application f()
is syntactic sugar in Scala for an apply()
method. So f()
is equivalent to f.apply()
.
This approach allows us to model the idea that an Age
/ a Weight
is an Int
, but an Int
is not an Age
/ a Weight
. This means we can treat an Age
/ a Weight
as an Int
and add, subtract, or do whatever other Int
-like things we want to do.
Mixing these two approaches in one example, you can see the difference between newtypes and tagged types. You must extract the "raw value" from a newtype. You do not need to do this with a tagged type
// `Age` is a tagged type
trait AgeTag
type Age = Int with AgeTag
object Age {
def apply(years: Int): Age = {
assert(years >= 0 && years < 150)
years.asInstanceOf[Age]
}
}
// `Weight` is a newtype
case class Weight(kgs: Int) {
assert(kgs >= 0 && kgs < 500)
}
// `Age`s can be treated as `Int`s, because they _are_ `Int`s
assert(40 == Age(10) + Age(30))
// `Weight`s are not `Int`s, they _contain_ `Int`s
Weight(10) + Weight(30) // does not compile
// To add `Weight`s, we must "unwrap" them
Weight(10).kgs + Weight(30).kgs
In some languages, the "unwrapping" of newtypes can be done automatically. This can make newtypes as ergonomic as tagged types. For example, in Scala, this could be done with an implicit conversion
implicit def weightAsInt(weight: Weight): Int = weight.kgs
// `Weight`s are not `Int`s, but they can be _converted_ to `Int`s
Weight(10) + Weight(30) // this now compiles
Further refinements
The important point of the above discussion is that, as much as possible, we want to make invalid states unrepresentable.
- "Jeff" is an invalid age. Age isn't a string, it is a number.
- -1 is an invalid age. Age cannot be negative, it should be 0 or positive, and probably less than about 150.
- My age is not 88. An age should be easily distinguishable from other integral values, like weight.
Everything discussed above implemented these refinements on the concept of "age", one at a time.
We can make further refinements if there is a need for those refinements.
For example, suppose we want to send a "Happy Birthday!" email to a Person
on their birthday. Rather than an Age
, we now need a date of birth.
case class Date(year: Year, month: Month, day: Day)
case class Year(value: Int, currentYear: Int) {
assert(value >= 1900 && value <= currentYear)
}
case class Month(value: Int) {
assert(value >= 1 && value <= 12)
}
case class Day(value: Int) {
assert(value >= 1 && value <= 31)
}
case class Person(dateOfBirth: Date, weight: Weight) {
def age(currentDate: Date): Age = {
??? // TODO calculate Age from dateOfBirth
}
}
The amount of information provided by dateOfBirth
is strictly greater than the amount of information provided by Age
. We can calculate someone's age from their date of birth, but we cannot do the opposite.
The above implementation leaves much to be desired, though -- there are lots of invalid states. A better way to implement this would be for Month
to be an enum, and for Day
validity to depend on the Month
(February never has 30 days, for example)
case class Year(value: Int, currentYear: Int) {
assert(value >= 1900 && value <= currentYear)
}
sealed trait Month
case object January extends Month
case object February extends Month
case object March extends Month
case object April extends Month
case object May extends Month
case object June extends Month
case object July extends Month
case object August extends Month
case object September extends Month
case object October extends Month
case object November extends Month
case object December extends Month
case class Day(value: Int, month: Month) {
month match {
case February => assert(value >= 1 && value <= 28)
case April | June | September | November => assert(value >= 1 && value <= 30)
case _ => assert(value >= 1 && value <= 31)
}
}
case class Date(year: Year, month: Month, day: Day)
Always prefer low-cardinality types to high-cardinality types, when possible. It limits the number of possible invalid states. In most languages, enum
s are the way to go here (in Scala 2, an enum
can be modelled using a sealed trait
, as shown above). But there are still invalid states hiding above. Can you find them?
In some cases, stringly-typed data validated using regular expressions can be replaced entirely by enum
s. Could you model Canadian postal codes such that it's impossible to construct an invalid one?
Use the above knowledge to go forth and make invalid states unrepresentable.