Are you Polish? FSharp will tell us, probably - part 2
Forenotes from 2023
This post was originally published on my blog on March 12th, 2017 and was the 3nd big entry of my Daj Się Poznać 2017 blog post series 🖋️. This is the continuation of Are you Polish? FSharp will tell us, probably - part 1. Happy reading!
Are you Polish? F# will tell us, probably - part 2
In Are you Polish? FSharp will tell us, probably - part 1, we had our first glance at F# and how we would use it to answer the question that is on everyone's lips: is your surname Polish? In today's post (the 2nd and last one of this series) we will go over the rest of the script that you can find on this public gist.
We already went over the countCharCI
function that returns the number of occurrences of a given character in a word. We can simply call the function like this:
countCharCI 'C' "ccc" // returns 3
What we'd like to do now is implement the concept of points, and more particularly this: Each Polish letter will earn the surname 1 point. For instance, If a word contains 3 polish letters, we'd like to assign 3 points to it.
Handling Polish letters
I introduced the new computePointsFor
function to help us achieve this:
/// Apply a function to all elements of a list, sum the results and multiply by the number of points
let computePointsFor func elems points word =
elems
|> List.map (fun e -> func e word)
|> List.reduce (+)
|> (\*) points
The function returns an int
and takes 4 parameters as input:
func
is a function that takes two parameters of type'a
and'b
and returns anint
.elems
is a list of'a
.points
is anint
.word
is of type'b
.
You may wonder what are those 'a
and 'b
about. This is the F# way of representing generic types! In C#, We'd probably have T1
and T2
instead. As explained in the previous post, F# is able to infer all those types from their usage.
The second interesting thing here is, as I mentioned above, func
is a function! As F# is a functional-first programming language, functions are treated as first-class citizens and you can pass them around as you normally would for any other parameter. This is of course possible in C# too, but it doesn't feel as natural and simple as in F#, especially when you throw generic types in the mix.
What does computePointsFor
actually do? we already are familiar with the |>
operator that we previously saw in the countCharCI
function. We can see at a glance that the function is composed of 3 steps:
- It applies the
func
function on every element of theelems
list, usingList.map
. It also usesword
as the second parameter offunc
. This steps returns a list of'c
. - It then sums together all elements from the previous steps, using the addition operator
(+)
in combination withList.reduce
.List.reduce
iterates over a list of elements, applying a function on them and storing the results in an accumulator that is passed to the next iteration, until it reaches the end of the list. This step returns anint
. - The last step takes the
int
returned from step 2 and multiplies it by thepoints
parameter. The end result is of course anint
.
Now that we have the function implemented, let's test it:
let polishChars = ['ą';'ć';'ę';'ł';'ń';'ó';'ś';'ż';'ź']
computePointsFor countCharCI polishChars 1 "Bob Johnson" // returns 0
computePointsFor countCharCI polishChars 1 "Robert Jonsłon" // returns 1
computePointsFor countCharCI polishChars 2 "Stanisław Wójcik" // returns 4
computePointsFor countCharCI polishChars 1 "" // returns 0
It seems to work as expected! Cool. As you can see, I passed countCharCI
as the func
parameter, the list of Polish characters polishChars
as the elems
parameter, 1
as the number of points
, and the surname we want to test against as the last parameter.
So far so good, but what if we were a bit more demanding and tried to make our code more readable? Let's create the new checkPolishChars
function that wraps nicely around countCharCI
and polishChars
to make our calls clear and succinct:
// Helper function with pre-baked parameters
let checkPolishChars points word =
computePointsFor countCharCI polishChars points word
And then test it:
checkPolishChars 3 "Józef Gwóźdź" // returns 12
Perfect. It looks like we are done with handling Polish letters, so let's go over the digraphs next.
Handling Polish digraphs
Similar to the countCharCI
function we saw above, I created countDigraphCI
as shown here:
/// Return the number of occurrences of a given digraph within a word (case-insensitive)
/// The function ignores overlaps (countDigraphCI "cc" "cccc" returns 2 and not 3)
let countDigraphCI (digraph:string) (word:string) =
let wordCI = word.ToLower()
let digraphCI = digraph.ToLower()
let rec loop occurrences index =
if index >= String.length wordCI then occurrences
else
match wordCI.IndexOf(digraphCI, index) with
| -1 -> occurrences // -1 means the substring was not found
| indexFound -> loop (occurrences + 1) (indexFound + String.length digraphCI)
if String.length word = 0 then 0
else loop 0 0
It obviously takes 2 parameters as input, a digraph
and a word
, both of type string
. There is something very interesting in this function, something that we haven't seen yet in the previous examples. It seems like there is... a function within the function?! That's correct, the loop
function is actually defined inside countDigraphCI
. Note that this functionality was recently added to C# 7.0, under the name of local functions.
The logic of loop
is straightforward. It is decorated by the rec
keyword, which means that the function is recursive. It looks for the first occurrence of the given digraph
within word
, by using the String.IndexOf
function from the .NET framework. If the digraph was found, it increments the occurrences
counter and recursively calls itself once again, this time starting at the new index. It continues to do so until it reaches the end of the string.
But Bryan, why not use a good ol' for loop instead? Why confuse us with this whole recursive wabble babble?
I hear you ask. This is because using a standard for loop would require creating a mutable variable to store the current state (the occurrences
parameter here). And F#, being a functional-first language, tries to stay away from mutable stuff unless truly necessary. Let's see the new function in action:
countDigraphCI "cz" "Szczerba" // 1
countDigraphCI "cz" "szCZerba" // 1
countDigraphCI "sz" "Szczerba" // 1
countDigraphCI "sz" "" // 0
countDigraphCI "cz" "Wieczorkiewicz" // 2
It seems to be working fine, once again. Let's reuse our computePointsFor
function and apply it here too:
// A digraph is a combination of letters that represent a single sound!
let polishDigraphs = ["ch";"cz";"dz";"dż";"dź";"rz";"sz"]
/// Apply computePointsFor to our countDigraphCI function
let checkPolishDigraphs points word =
computePointsFor countDigraphCI polishDigraphs points word
checkPolishDigraphs 1 "szczrz" // 3
checkPolishDigraphs 2 "Dziurdź" // 4
checkPolishDigraphs 3 "Szczerba" // 6
checkPolishDigraphs 1 "Błaszczyszyn" // 3
Boom! I hope you can see how convenient it is to be able to pass two different functions with different signatures to computePointsFor
and let F# deal with all the types automatically.
Time to take care of our last problem, the Polish endings.
Handling Polish endings
The finishWithCI
function is straightforward and makes use of the String.EndsWith
function that is part of the .NET framework. Here is its implementation together with a series of tests:
/// Check whether a word has the given ending (case-insensitive)
let finishWithCI (suffix:string) (word:string)=
word.ToLower().EndsWith(suffix.ToLower())
finishWithCI "ski" "kowalski" // true
finishWithCI "SKi" "kowalsKI" // true
finishWithCI "wicz" "Nowak" // false
finishWithCI "" "Nowak" // true
finishWithCI "" "" // true
finishWithCI "ski" "" // false
However, checking if a given surname has one of the possible endings we defined earlier is a bit different than counting the number of occurrences of a letter or a digraph. So I needed another function, checkConditions
, defined below:
/// Apply a list of criteria on a subject and return as soon as one matches the given condition
let checkConditions criteria condition subject =
let rec loop remainingElems = // sub-function, loop over the criteria
match remainingElems with
| [] -> (false, None)
| c::r ->
match condition c subject with
| true -> (true, Some c) // exit as soon as we get a positive result
| false -> loop r // else loop over the next element
loop criteria
I won't go into too much details here because this post is getting long enough, but we can notice two things:
checkConditions
also defines a local recursive function calledloop
.- The loop function makes extensive use of an F# feature called pattern matching. I will most probably write a separate post on this topic as this is a very powerful mechanism.
The function basically checks a condition
against a list of criteria
and returns as soon as one of the criteria fulfils the condition. It returns a tuple of the following form:
- either
(false, None)
if NO criterion fulfilled the condition. - or
(true, Some criterion)
if a criterion fulfilled the condition.
We can see it in action below:
let polishEndings = ["wicz";"czyk";"wski";"wska";"ński";"ńska";"ski";...]
checkConditions polishEndings finishWithCI "Kowalski"
// returns (true, Some "ski")
That concludes the part about the Polish endings. Once again, let's wrap those functions in a convenient and clear helper:
/// Return a certain amount of points if the given word has one of the most common Polish endings
let checkPolishEndings points word =
let success, _ = checkConditions polishEndings finishWithCI word
if success then points else 0
We're almost there!
We are very close to getting our magical function that tells us if a surname is Polish or not. I created a new type that would represent our possible answers. This is called a discriminated union in F#:
type NameOrigin = DefinitelyPolish | ProbablyPolish | NotPolish
Let's throw 2 more functions in the mix. The first one is calculatePolishDensity
and the comment below is self-explanatory. It also makes use of pattern matching:
/// Divide the number of points obtained for a given word by the length of the word itself
let calculatePolishDensity (word:string, points) =
match word.Length with
| 0 -> (word, 0.)
| len -> (word, float points / float len)
The second one is decideIfPolish
. It converts the Polish density parameter to NameOrigin
based on the rules we defined in our previous post:
- Any density below 0.2 would yield Not Polish.
- Any density between 0.2 and 0.8 would yield Probably Polish.
- Any density above 0.8 would yield Definitely Polish.
Once again, the implementation is straightforward:
/// Decide if a word is Polish based on its density
let decideIfPolish (word, density) =
if (density < 0.2) then
(word, NameOrigin.NotPolish, density)
else if (density >= 0.2 && density <= 0.8) then
(word, NameOrigin.ProbablyPolish, density)
else
(word, NameOrigin.DefinitelyPolish, density)
Let's prepare the field for our final function and create checkAllPolishConditions
that will calculate the total number of points for a surname, based on Polish letters, digraphs and endings:
let checkAllPolishConditions pointsPerChar pointsPerEnding pointsPerDigraph =
checkPolishCharsPipe pointsPerChar
>> checkPolishEndingsPipe pointsPerEnding
>> checkPolishDigraphsPipe pointsPerDigraph
The >>
operator is called the composition operator in F#. It allows to create one big function out of several smaller functions with compatible signatures. We will surely go back to this topic in a future post.
And here it finally comes, the long-awaited isNamePolish
function in all its splendor:
let isNamePolish name =
(name, 0)
|> checkAllPolishConditions 1 6 3
|> calculatePolishDensity
|> decideIfPolish
... which checks all conditions and assigns points, calculate the Polish density and then decides whether the name is Polish based on the score obtained! Unleash the beast!
isNamePolish "Młynarczyk" // returns ("Młynarczyk", DefinitelyPolish, 1.0)
Młynarczyk is Definitely Polish, the algorithm got it right. Next!
isNamePolish "Johnson" // returns ("Johnson", NotPolish, 0.0)
Johnson is Not Polish, once again the algorithm worked as expected! Let's try one more.
isNamePolish "Kowalski" // returns ("Kowalski", ProbablyPolish, 0.75)
Hmm, the algorithm says Probably Polish whereas the expected result would be Definitely Polish. We'd probably need to tweak it a bit, either by updating the points system or the decision based on the density.
Last one for the road, as requested by my colleague Marcin. Brzęczyszczykiewicz. This has got to be the most Polish, or polishest if you will, surname ever. and the result is...
isNamePolish "Brzęczyszczykiewicz"
// returns ("Brzęczyszczykiewicz", DefinitelyPolish, 1.157894737)
Definitely Polish!
Thank you F#, you are amazing!
This concludes our series about Polish surnames, hope you liked it!
Cheers!