Are you Polish? FSharp will tell us, probably - part 2

#fsharp #programming #dajsiepoznac #dsp #blog #polish

Forenotes from 2023

This post was originally published on my blog on March 12th, 2017 and was the 3nd big entry of my Daj Się Poznać 2017 blog post series 🖋️. This is the continuation of Are you Polish? FSharp will tell us, probably - part 1. Happy reading!

Are you Polish? F# will tell us, probably - part 2

In Are you Polish? FSharp will tell us, probably - part 1, we had our first glance at F# and how we would use it to answer the question that is on everyone's lips: is your surname Polish? In today's post (the 2nd and last one of this series) we will go over the rest of the script that you can find on this public gist.

We already went over the countCharCI function that returns the number of occurrences of a given character in a word. We can simply call the function like this:

countCharCI 'C' "ccc" // returns 3

What we'd like to do now is implement the concept of points, and more particularly this: Each Polish letter will earn the surname 1 point. For instance, If a word contains 3 polish letters, we'd like to assign 3 points to it.

Handling Polish letters

I introduced the new computePointsFor function to help us achieve this:

/// Apply a function to all elements of a list, sum the results and multiply by the number of points
let computePointsFor func elems points word =
  elems
  |> List.map (fun e -> func e word)
  |> List.reduce (+)
  |> (\*) points

The function returns an int and takes 4 parameters as input:

func is a function that takes two parameters of type 'a and 'b and returns an int.
elems is a list of 'a.
points is an int.
word is of type 'b.

You may wonder what are those 'a and 'b about. This is the F# way of representing generic types! In C#, We'd probably have T1 and T2 instead. As explained in the previous post, F# is able to infer all those types from their usage.

The second interesting thing here is, as I mentioned above, func is a function! As F# is a functional-first programming language, functions are treated as first-class citizens and you can pass them around as you normally would for any other parameter. This is of course possible in C# too, but it doesn't feel as natural and simple as in F#, especially when you throw generic types in the mix.

What does computePointsFor actually do? we already are familiar with the |> operator that we previously saw in the countCharCI function. We can see at a glance that the function is composed of 3 steps:

It applies the func function on every element of the elems list, using List.map. It also uses word as the second parameter of func. This steps returns a list of 'c.
It then sums together all elements from the previous steps, using the addition operator (+) in combination with List.reduce. List.reduce iterates over a list of elements, applying a function on them and storing the results in an accumulator that is passed to the next iteration, until it reaches the end of the list. This step returns an int.
The last step takes the int returned from step 2 and multiplies it by the points parameter. The end result is of course an int.

Now that we have the function implemented, let's test it:

let polishChars = ['ą';'ć';'ę';'ł';'ń';'ó';'ś';'ż';'ź']

computePointsFor countCharCI polishChars 1 "Bob Johnson" // returns 0
computePointsFor countCharCI polishChars 1 "Robert Jonsłon" // returns 1
computePointsFor countCharCI polishChars 2 "Stanisław Wójcik" // returns 4
computePointsFor countCharCI polishChars 1 "" // returns 0

It seems to work as expected! Cool. As you can see, I passed countCharCI as the func parameter, the list of Polish characters polishChars as the elems parameter, 1 as the number of points, and the surname we want to test against as the last parameter.

So far so good, but what if we were a bit more demanding and tried to make our code more readable? Let's create the new checkPolishChars function that wraps nicely around countCharCI and polishChars to make our calls clear and succinct:

// Helper function with pre-baked parameters
let checkPolishChars points word =
  computePointsFor countCharCI polishChars points word

And then test it:

checkPolishChars 3 "Józef Gwóźdź" // returns 12

Perfect. It looks like we are done with handling Polish letters, so let's go over the digraphs next.

Handling Polish digraphs

Similar to the countCharCI function we saw above, I created countDigraphCI as shown here:

/// Return the number of occurrences of a given digraph within a word (case-insensitive)
/// The function ignores overlaps (countDigraphCI "cc" "cccc" returns 2 and not 3)
let countDigraphCI (digraph:string) (word:string) =
  let wordCI = word.ToLower()
  let digraphCI = digraph.ToLower()

  let rec loop occurrences index =
    if index >= String.length wordCI then occurrences
    else
      match wordCI.IndexOf(digraphCI, index) with
      | -1 -> occurrences // -1 means the substring was not found
      | indexFound -> loop (occurrences + 1) (indexFound + String.length digraphCI)

  if String.length word = 0 then 0
  else loop 0 0

It obviously takes 2 parameters as input, a digraph and a word, both of type string. There is something very interesting in this function, something that we haven't seen yet in the previous examples. It seems like there is... a function within the function?! That's correct, the loop function is actually defined inside countDigraphCI. Note that this functionality was recently added to C# 7.0, under the name of local functions.

The logic of loop is straightforward. It is decorated by the rec keyword, which means that the function is recursive. It looks for the first occurrence of the given digraph within word, by using the String.IndexOf function from the .NET framework. If the digraph was found, it increments the occurrences counter and recursively calls itself once again, this time starting at the new index. It continues to do so until it reaches the end of the string.

But Bryan, why not use a good ol' for loop instead? Why confuse us with this whole recursive wabble babble?

I hear you ask. This is because using a standard for loop would require creating a mutable variable to store the current state (the occurrences parameter here). And F#, being a functional-first language, tries to stay away from mutable stuff unless truly necessary. Let's see the new function in action:

countDigraphCI "cz" "Szczerba" // 1
countDigraphCI "cz" "szCZerba" // 1
countDigraphCI "sz" "Szczerba" // 1
countDigraphCI "sz" "" // 0
countDigraphCI "cz" "Wieczorkiewicz" // 2

It seems to be working fine, once again. Let's reuse our computePointsFor function and apply it here too:

// A digraph is a combination of letters that represent a single sound!
let polishDigraphs = ["ch";"cz";"dz";"dż";"dź";"rz";"sz"]
  
/// Apply computePointsFor to our countDigraphCI function
let checkPolishDigraphs points word =
  computePointsFor countDigraphCI polishDigraphs points word

checkPolishDigraphs 1 "szczrz" // 3
checkPolishDigraphs 2 "Dziurdź" // 4
checkPolishDigraphs 3 "Szczerba" // 6
checkPolishDigraphs 1 "Błaszczyszyn" // 3

Boom! I hope you can see how convenient it is to be able to pass two different functions with different signatures to computePointsFor and let F# deal with all the types automatically.

Time to take care of our last problem, the Polish endings.

Handling Polish endings

The finishWithCI function is straightforward and makes use of the String.EndsWith function that is part of the .NET framework. Here is its implementation together with a series of tests:

/// Check whether a word has the given ending (case-insensitive)
let finishWithCI (suffix:string) (word:string)=
  word.ToLower().EndsWith(suffix.ToLower())

finishWithCI "ski" "kowalski" // true
finishWithCI "SKi" "kowalsKI" // true
finishWithCI "wicz" "Nowak" // false
finishWithCI "" "Nowak" // true
finishWithCI "" "" // true
finishWithCI "ski" "" // false

However, checking if a given surname has one of the possible endings we defined earlier is a bit different than counting the number of occurrences of a letter or a digraph. So I needed another function, checkConditions, defined below:

/// Apply a list of criteria on a subject and return as soon as one matches the given condition
let checkConditions criteria condition subject =
  let rec loop remainingElems = // sub-function, loop over the criteria
    match remainingElems with
    | [] -> (false, None)
    | c::r ->
      match condition c subject with
      | true -> (true, Some c) // exit as soon as we get a positive result
      | false -> loop r // else loop over the next element

  loop criteria

I won't go into too much details here because this post is getting long enough, but we can notice two things:

checkConditions also defines a local recursive function called loop.
The loop function makes extensive use of an F# feature called pattern matching. I will most probably write a separate post on this topic as this is a very powerful mechanism.

The function basically checks a condition against a list of criteria and returns as soon as one of the criteria fulfils the condition. It returns a tuple of the following form:

either (false, None) if NO criterion fulfilled the condition.
or (true, Some criterion) if a criterion fulfilled the condition.

We can see it in action below:

let polishEndings = ["wicz";"czyk";"wski";"wska";"ński";"ńska";"ski";...]

checkConditions polishEndings finishWithCI "Kowalski"
// returns (true, Some "ski")

That concludes the part about the Polish endings. Once again, let's wrap those functions in a convenient and clear helper:

/// Return a certain amount of points if the given word has one of the most common Polish endings
let checkPolishEndings points word =
  let success, _ = checkConditions polishEndings finishWithCI word
  if success then points else 0

We're almost there!

We are very close to getting our magical function that tells us if a surname is Polish or not. I created a new type that would represent our possible answers. This is called a discriminated union in F#:

type NameOrigin = DefinitelyPolish | ProbablyPolish | NotPolish

Let's throw 2 more functions in the mix. The first one is calculatePolishDensity and the comment below is self-explanatory. It also makes use of pattern matching:

/// Divide the number of points obtained for a given word by the length of the word itself
let calculatePolishDensity (word:string, points) =
  match word.Length with
  | 0 -> (word, 0.)
  | len -> (word, float points / float len)

The second one is decideIfPolish. It converts the Polish density parameter to NameOrigin based on the rules we defined in our previous post:

Any density below 0.2 would yield Not Polish.
Any density between 0.2 and 0.8 would yield Probably Polish.
Any density above 0.8 would yield Definitely Polish.

Once again, the implementation is straightforward:

/// Decide if a word is Polish based on its density
let decideIfPolish (word, density) =
  if (density < 0.2) then
    (word, NameOrigin.NotPolish, density)
  else if (density >= 0.2 && density <= 0.8) then
    (word, NameOrigin.ProbablyPolish, density)
  else
    (word, NameOrigin.DefinitelyPolish, density)

Let's prepare the field for our final function and create checkAllPolishConditions that will calculate the total number of points for a surname, based on Polish letters, digraphs and endings:

let checkAllPolishConditions pointsPerChar pointsPerEnding pointsPerDigraph =
  checkPolishCharsPipe pointsPerChar
  >> checkPolishEndingsPipe pointsPerEnding
  >> checkPolishDigraphsPipe pointsPerDigraph

The >> operator is called the composition operator in F#. It allows to create one big function out of several smaller functions with compatible signatures. We will surely go back to this topic in a future post.

And here it finally comes, the long-awaited isNamePolish function in all its splendor:

let isNamePolish name =
  (name, 0)
  |> checkAllPolishConditions 1 6 3
  |> calculatePolishDensity
  |> decideIfPolish

... which checks all conditions and assigns points, calculate the Polish density and then decides whether the name is Polish based on the score obtained! Unleash the beast!

isNamePolish "Młynarczyk" // returns ("Młynarczyk", DefinitelyPolish, 1.0)

Młynarczyk is Definitely Polish, the algorithm got it right. Next!

isNamePolish "Johnson" // returns ("Johnson", NotPolish, 0.0)

Johnson is Not Polish, once again the algorithm worked as expected! Let's try one more.

isNamePolish "Kowalski" // returns ("Kowalski", ProbablyPolish, 0.75)

Hmm, the algorithm says Probably Polish whereas the expected result would be Definitely Polish. We'd probably need to tweak it a bit, either by updating the points system or the decision based on the density.

Last one for the road, as requested by my colleague Marcin. Brzęczyszczykiewicz. This has got to be the most Polish, or polishest if you will, surname ever. and the result is...

isNamePolish "Brzęczyszczykiewicz"
// returns ("Brzęczyszczykiewicz", DefinitelyPolish, 1.157894737)

Definitely Polish!

Thank you F#, you are amazing!

This concludes our series about Polish surnames, hope you liked it!

Cheers!