Are you Polish? FSharp will tell us, probably - part 1

#fsharp #programming #dajsiepoznac #dsp #blog #polish

Forenotes from 2023

This post was originally published on my blog on March 11th, 2017 and was the 2nd big entry of my Daj Się Poznać 2017 blog post series 🖋️. Back then I was primarily a C# developer and I had been reading a lot about this other, less known .NET language called F#. The language, and the way it forced me to relearn how I think about programming, quickly had me fascinated. It seemed like sharing my findings about F# with others would be a great way to get more proficient with it, and perhaps spread around my newly found enthusiasm :). A few years later, I can definitely say that F# had a deep influence on my career. Let's dive into it 🙂.

Are you Polish? F# will tell us, probably - part 1

In my previous post I created a small Hello, World! web application using WebSharper. As F# is predominant in WebSharper, I figured I'd write a few posts about the language itself. I didn't want to write yet another F# tutorial focused on theory (there are plenty of those out in the Wild), so I came up with a simple use case I could use as material for those posts. How about we wrote a small script that would tell the probability a given surname is Polish?

This would serve as:

a smooth and practical introduction to F# for those who don't know it yet, with just a pinch of functional programming theory (nothing too complicated, I promise).
a short refresher for the others (myself included).
a unique opportunity to learn a little more about Polish surnames!

Let's start then. First, let's describe the problem at hand and what we want to achieve. Given an arbitrary surname, we want to know if it is either:

Definitely Polish
Probably Polish
Not Polish

The magical algorithm

In order to do so, I came up with a rather naive algorithm that would help us get the desired results. This algorithm is based on 3 rules:

How many exclusive Polish letters does the surname contain? The Polish language has a set of of 9 letters based on the Latin alphabet, but combined with diacritics. Here is the complete list: ą, ć, ę, ł, ń, ó,'ś, ż, ź
How many Polish digraphs does the surname contain? Put simply, a digraph is a pair of letters that, combined together, form a new letter with a sound on its own. The Polish language contains 7 digraphs: ch, cz, dz, dż, dź, rz, sz
Does the surname end with a common Polish ending? After analysing a few dozens of popular Polish surnames, I put together a list of the most common endings: wicz, czyk, wski, wska, ński, ńska, ski, ska, cki, cka, ło, ła, ak, rz You will notice that some of them are quite similar, like wski / wska or cki / cka. The forms ending by i usually mean the bearer is a man while the ones ending by a mean the bearer is a woman. Easy!

The algorithm would be a bit too minimalist if we stopped here, so I decided to introduce the concept of points to it:

Each Polish letter will earn the surname 1 point.
Each Polish digraph will earn the surname 3 points.
A Polish ending will earn the surname 6 points!

And last but not least, by dividing the total of points by the number of letter in the surname, we would obtain its Polish density, which in turn would give us our final answer:

Any density below 0.2 would yield Not Polish.
Any density between 0.2 and 0.8 would yield Probably Polish.
Any density above 0.8 would yield Definitely Polish.

Let's apply this to a sample surname: Młynarz

it would earn 10 points (1 for the ł letter, 3 for the he rz digraph, and 6 for the rz ending!)
the density would be 10 points divided by 7 letters = 1.43
the expected end result would be: Definitely Polish

Voilà! As you're probably blown away by the complexity of the algorithm, let's take a minute to catch our breath.

The magical code

Without further ceremony, let's go over the code that you can find on this public gist. Here are a few figures about it. The single file contains:

167 lines in total
of which 60 are either blank lines or comments (35.9%).
of which 39 are evaluations or tests (23.4%).
of which 68 are actual logic (40.7%).
3 declarations containing the Polish letters, digraphs and endings described above.
The whooping total of 16 function definitions! That's an average of 4 lines per function.

The file is an F# script (with the .fsx extension), which means you can run it dynamically within the F# Interactive console, by highlighting part of the code, right-clicking and choosing the Execute in Interactive option within Visual Studio:

This is a wonderful feature that allows for very fast development cycles: you write a new function, test it as you go and copy it over to your production code once you are pleased with the results.

Let's have a look at the very first function of the script, countCharCI. The function returns the number of occurrences of a character within a given word, and it obviously takes 2 parameters as input: char and word.

/// Return the number of occurrences of a given char within a word (case-insensitive)
let countCharCI char word =
  word
  |> Seq.filter (fun current -> Char.ToLower current = Char.ToLower char)
  |> Seq.length

Y'all get any more of them types?

I hear you say already:

"But Jimmy, isn't F# supposed to be a strongly-typed language? Can't see any of those type declarations in here."

Thanks for asking. This is because the F# compiler features a powerful type inference system and is able to guess the type of your parameters from their usage (most of the time). Even if you don't directly see them in the code, the types are really there though and will be enforced by the compiler and run-time. This prevents a lot of funky issues you could get with weakly-typed languages (Yes JavaScript, I am looking at you!). You can check this by hovering over the parameters' names in Visual Studio.

The char parameter is of type char:

The word parameter is of type seq<char> (a sequence of char, which a string is):

And finally the whole function has the following signature:
_char:char -> word:seq<char> -> int

Which basically means that it takes 2 parameters as input (the char and the seq<char>) and returns an int. Perfect! Sometimes though, the F# compiler will need our help figuring out the type of certain parameters. This is for instance the case for our one-liner finishWithCI function at line 92:

/// Check whether a word has the given ending (case-insensitive)
let finishWithCI (suffix:string) (word:string) =
  word.ToLower().EndsWith(suffix.ToLower())

Here we need to explicitly state the type of both suffix and word parameters in order to use the ToLower and EndsWith functions on them. This lets the compiler know that those two functions indeed belong to the string type. There are currently 32 parameters defined in the whole script, but only 5 of them actually are explicitly annotated. I love it!

As this post is getting pretty long (once again!), let's end by going over the logic of countCharCI:

/// Return the number of occurrences of a given char within a word (case-insensitive)
let countCharCI char word =
  word
  |> Seq.filter (fun current -> Char.ToLower current = Char.ToLower char)
  |> Seq.length

The function makes use of the Forward Pipe operator |> twice. This operator is omnipresent in F# and we will dedicated a whole blog post to it soon. Simply put, what it does is pass the value on the left side (word here) as the last parameter of the function on the right side (Seq.filter).

The Seq.filter function applies its 1st parameter, a predicate function, to each element of the sequence passed as its 2nd parameter. In our particular case, it will check if characters contained in word match the reference character passed to countCharCI. It will make the comparison case-insensitive by calling Char.ToLower on each side of the equality operator first:

fun current -> Char.ToLower current = Char.ToLower char

The fun keyword above allows to define a lambda expression, the same way as you would in C#. Please note that C# uses a fat arrow => while while F# uses a slim arrow ->. The Seq.filter itself returns a new sequence of characters that matched the predicate above. This sequence is then piped to Seq.length that simply returns its total length (thus the number of occurrences of char within word).

We're almost there. Let's test our shiny new function with the following evaluations:

countCharCI 'C' "ccc" // returns 3
countCharCI 'c' "cCc" // returns 3
countCharCI 'c' "" // returns 0
countCharCI '2' "145" // returns 0
countCharCI '3' "3f3sdf" // returns 2

It works! Let's call it a day! I will follow up with a second post (Are you Polish? FSharp will tell us, probably - part 2) where we will see the whole script in action, and finally get to answer the crucial question: Is your surname Polish?

More soon, take care!