Solve Sudokus with R and PicoSAT

2017-07-07

Recently I wrote a small package to use the PicoSAT solver in R. PicoSAT is a C program written by Armin Biere that can solve the so called SAT problem. Given a boolean formula, such as A OR B AND NOT C, can we set A, B and C to TRUE or FALSE such that the formula evaluates to TRUE?

It turns out many problems (such as a Sudoku) can be formulated as a SAT problem.

Bindings for PicoSAT exist for various languages but not for R. Also, it seems there is currently no package for solving SAT problems on CRAN.

What and why??

Boolean logic is everywhere. Whenever you filter something with dplyr you construct a boolean formula that evaluates to TRUE for the elements you would like to select.

To quote Wikipedia:

A propositional logic formula, also called Boolean expression, is built from variables, operators AND (conjunction, also denoted by ∧), OR (disjunction, ∨), NOT (negation, ¬), and parentheses. A formula is said to be satisfiable if it can be made TRUE by assigning appropriate logical values (i.e. TRUE, FALSE) to its variables.

The Boolean satisfiability problem (SAT) is, given a formula, to check whether it is satisfiable. This decision problem is of central importance in various areas of computer science, including theoretical computer science, complexity theory, algorithmics, cryptography and artificial intelligence.

As an example, consider the following formula:

\[ (A \Rightarrow B) \wedge (B \Rightarrow C) \wedge (C \Rightarrow A) \]

This basically means that we have three rules, if A is TRUE, then B is TRUE, if B is TRUE then C is TRUE and if C is TRUE, A is TRUE.

This can be formulated as a CNF (conjunctive normal form). Consider a CNF a standard format in which you can write boolean formulas. A CNF consists of clauses and literals. Each clause must hold and is hence concatinated by ANDs (\(\wedge\))). Within each clause, at least one of the literals must be TRUE (hence the concatination by OR (\(\vee\))).

\[ (\neg A \vee B) \wedge (\neg B \vee C) \wedge (\neg C \vee A) \]

One easy way to find all combinations of A, B and C for which the above formula is satisfiable is to simply evaluate all combinations. This can be done in base R fairly easily with expand.grid:

tf <- c(TRUE,FALSE)
combinations <- expand.grid(A = tf, B = tf, C = tf)
combinations$formula <- with(combinations, (!A | B) & (!B | C) & (!C | A))
combinations[combinations$formula == TRUE, ]
##       A     B     C formula
## 1  TRUE  TRUE  TRUE    TRUE
## 8 FALSE FALSE FALSE    TRUE

We found two assignments for which the above formula yields TRUE. We can conclude that it is satisfiable.

Enumerating all possible combinations might be feasible for a few variables, but not for thousands as the search space grows exponentially. And this is where specialized SAT solvers come into play.

The rpicosat package

rpicosat aims to be a lightweight wrapper around the original PicoSAT library.

The package can be installed from Github:

devtools::install_github("dirkschumacher/rpicosat")

In rpicosat the problem is encoded as a list of integer vectors. One vector for each clause.

The expression from before is encoded in the following way where integers denote the variables. Negative integers are negated variables.

formula <- list(
  c(-1, 2), # not 1 OR 2
  c(-2, 3), # not 2 OR 3
  c(-3, 1) # not 3 OR 1
)

The essential function in the package is picosat_sat that finds a solution to the SAT problem.

library(rpicosat)
res <- picosat_sat(formula)
res
## Variables: 3
## Clauses: 3
## Solver status: PICOSAT_SATISFIABLE

Every solution is also a data.frame so it can easily be used with other packages (e.g. dplyr). The data frame has two columns, variable and value.

as.data.frame(res)
##   variable value
## 1        1 FALSE
## 2        2 FALSE
## 3        3 FALSE

One last feature we need for a Sudoku solver is what PicoSAT calls assumptions. These are preset values for literals.

For example, if we assume that 1 is TRUE, then the valid solution is this:

as.data.frame(picosat_sat(formula, assumptions = c(1L)))
##   variable value
## 1        1  TRUE
## 2        2  TRUE
## 3        3  TRUE

Similarly, if we assume that 1 is TRUE, but 2 is FALSE, the formula is unsatisfiable.

picosat_sat(formula, assumptions = c(1L, -2L))
## Solver status: PICOSAT_UNSATISFIABLE

At this point, the maintainers of the package BoolNet deserve an hounarable mention, as they contriuted a patch to make the PicoSAT C code fully compatible with R.

A Sudoku solver

To showcase how a SAT solver can be useful we solve a Sudoku. This problem is particular interesting as I have written before about solving it using integer programming before.

In order to solve a Sudoku we need to find a way to encode the Sudoku problem as a boolean expression that is satisfiable if and only if a solution to a Sudoku exists. Luckily Tjark Weber has written an article describing a strategy I used for this article. Also the author of the python bindings for PicoSAT used Sudokus as an example as well.

The main idea is to create a variable \(x_{i,j,v}\) for each cell and number which is TRUE iff the respective number is in that cell. For example, \(x_{1,1,1} = TRUE\) means, that the number 1 is in the top left cell. Based on this we will create 5 groups of clauses that define a valid Sudoku. But first, let’s prepare some helper data:

library(dplyr)
library(purrr)
library(tidyr)

In order to easily select certain subsets of the variables we create a data frame with one row per variable and a column per index \(i,j,v\). In addition, we add the \(3 \times 3\) block where the variable belongs to. Each variable furthermore has a value index which is its unique name; it translates directly to the variables passed to rpicosat.

grid_helper <- expand.grid(i = 1:9, j = 1:9, v = 1:9) %>% 
  arrange(i, j, v) %>% 
  mutate(index = row_number()) %>% 
  mutate(block = 1 + floor((i - 1) / 3) * 3 + floor((j - 1) / 3))
head(grid_helper)
##   i j v index block
## 1 1 1 1     1     1
## 2 1 1 2     2     1
## 3 1 1 3     3     1
## 4 1 1 4     4     1
## 5 1 1 5     5     1
## 6 1 1 6     6     1

If you look at the first line it reads, variable with index (name) 1 is in row 1, column 1 and block 1.

In total we have this many variables:

nrow(grid_helper)
## [1] 729

At least on number per cell

For all \(i,j\) we build a clause \(\bigvee \limits_{v=1}^{9} x_{i,j,v}\). So in order for this clause to evaluate to TRUE, one of its variables needs to be TRUE and thus at least one number is present per cell.

# we nest the data per cell i,j
per_cell <- nest(grid_helper, v, index)$data

# then the clauses are straight forward.
at_least_one <- map(per_cell, ~.x$index)
head(at_least_one, 3)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9
## 
## [[2]]
## [1] 10 11 12 13 14 15 16 17 18
## 
## [[3]]
## [1] 19 20 21 22 23 24 25 26 27

In total we have 81 such clauses.

At most one number per cell

In order to have exactly one number per cell, we create a clause for each cell and all combinations of two numbers in that cell. This clause forbids both numbers to be present at the same time.

So for each cell \(i,j\) and each pair of values \(1\leq v_1 < v_2 \leq 9\): \(\neg(x_{i,j,v_1} \wedge x_{i,j,v_2})\).

The code looks a bit strange, but just goes through our data.frame and selects the correct pairs of variables.

at_most_one  <- flatten(map(per_cell, function(ij_values) {
  flatten(map(1:8, function(v1) {
    map((v1 + 1):9, function(v2) {
      -1 * c(
        ij_values[ij_values$v == v1, ]$index,
        ij_values[ij_values$v == v2, ]$index)
    })
  }))
}))
head(at_most_one, 3)
## [[1]]
## [1] -1 -2
## 
## [[2]]
## [1] -1 -3
## 
## [[3]]
## [1] -1 -4

This gives us 2916 clauses.

No two equal numbers per row

Again for each row \(i\) and each value \(v\) we find all tuples for columns and negate them: \(\forall1\leq j_1 < j_2 \leq 9\): \(\neg(x_{i,j_1,v} \wedge x_{i,j_2,v})\)

no_two_per_line <- flatten(map(1:9, function(i) {
  line_df <- grid_helper[grid_helper$i == i, ]
  flatten(map(1:9, function(v) {
    line_df_v <- line_df[line_df$v == v, ]
    flatten(map(1:8, function(j1) {
      map((j1 + 1):9, function(j2) {
          -1 * c(
            line_df_v[line_df_v$j == j1, ]$index,
            line_df_v[line_df_v$j == j2, ]$index
          )
      })
    }))  
  }))
}))
head(no_two_per_line, 3)
## [[1]]
## [1]  -1 -10
## 
## [[2]]
## [1]  -1 -19
## 
## [[3]]
## [1]  -1 -28

In total we have 2916 clauses.

No two equal numbers per column

The same can be done for each column:

no_two_per_col <- flatten(map(1:9, function(j) {
  col_df <- grid_helper[grid_helper$j == j, ]
  flatten(map(1:9, function(v) {
    col_df_v <- col_df[col_df$v == v, ]
    flatten(map(1:8, function(i1) {
      map((i1 + 1):9, function(i2) {
          -1 * c(
            col_df_v[col_df_v$i == i1, ]$index,
            col_df_v[col_df_v$i == i2, ]$index
          )
      })
    }))  
  }))
}))
head(no_two_per_col, 3)
## [[1]]
## [1]  -1 -82
## 
## [[2]]
## [1]   -1 -163
## 
## [[3]]
## [1]   -1 -244

In total we have 2916 clauses.

No two equal numbers per block

The last constraint forbids the same number being in a block twice. Again we loop through each block, value and pairs of variables. Here the block column in the grid_helper comes in handy.

no_two_per_block <- flatten(map(1:9, function(b) {
  line_df <- grid_helper[grid_helper$block == b, ]
  flatten(map(1:9, function(v) {
    indexes <- line_df[line_df$v == v, ]$index
    flatten(map(1:8, function(i1) {
      map((i1 + 1):9, function(i2) {
          -1 * c(
            indexes[i1],
            indexes[i2]
          )
      })
    }))  
  }))
}))
head(no_two_per_block, 3)
## [[1]]
## [1]  -1 -10
## 
## [[2]]
## [1]  -1 -19
## 
## [[3]]
## [1]  -1 -82

Again 2916 clauses were added.

Solve it

Having prepared all the clauses, we can combine them to one giant CNF.

formula <- c(
  at_least_one,
  at_most_one,
  no_two_per_line,
  no_two_per_col,
  no_two_per_block
)
length(formula)
## [1] 11745

Even though we create a fairly large number of clauses, PicoSAT is able to find a solution in less than a second.

This CNF can then be passed to PicoSAT:

library(rpicosat)
result <- picosat_sat(formula)
result
## Variables: 729
## Clauses: 11745
## Solver status: PICOSAT_SATISFIABLE

The result is also a data.frame and we use tidyr to spread the \(j\) as columns. This then roughly looks like a Sudoku ;)

format_result <- function(x) {
  df <- x %>%
    right_join(grid_helper, by = c("variable" = "index")) %>% 
    filter(value == TRUE) %>% 
    select(i, j, v) %>% 
    spread(j, v)
  select(df, -i) %>% 
    knitr::kable(row.names = 1:9)
}
format_result(result)
1 2 3 4 5 6 7 8 9
1 9 8 7 6 5 4 3 2 1
2 6 5 4 3 2 1 9 7 8
3 3 2 1 9 8 7 6 5 4
4 8 9 6 7 4 5 2 1 3
5 7 4 5 2 1 3 8 9 6
6 2 1 3 8 9 6 7 4 5
7 5 7 9 4 6 8 1 3 2
8 4 6 2 1 3 9 5 8 7
9 1 3 8 5 7 2 4 6 9

Of course a Sudoku is only fun if some numbers are set beforehand. We can fix this by setting the respective variables to TRUE. This can be done by passing an integer vector to picosat_sat. Within picosat these are called assumptions.

So let us assume the first row is fixed ascending from 1 to 9 and the last col is set to c(9, 1:8).

fixed_vars1 <- filter(grid_helper, i == 1, j == v)$index
fixed_vars2 <- filter(grid_helper, j == 9, i == v + 1, i > 1)$index
result <- picosat_sat(formula, assumptions = c(fixed_vars1, fixed_vars2))

With these assumed starting values, the resulting Sudoku looks like this:

format_result(result)
1 2 3 4 5 6 7 8 9
1 1 2 3 4 5 6 7 8 9
2 9 8 6 2 7 3 5 4 1
3 7 5 4 9 8 1 6 3 2
4 8 9 7 6 4 5 2 1 3
5 6 3 5 8 1 2 9 7 4
6 4 1 2 7 3 9 8 6 5
7 5 7 8 3 9 4 1 2 6
8 3 6 9 1 2 8 4 5 7
9 2 4 1 5 6 7 3 9 8

Finding all possible solutions

Now the last question one could ask is: is this solution unique? To check this we can enumerate all possible solutions to a Sudoku with these assumptions. In order to do that, whenever we find a solution, we build another formula disallowing the found solution until the formula becomes unsatisfiable. This can easily be done by adding the negation of the solution as another clause (read=“ and not this solution”).

As an example we will enumerate all solutions to the Sudoku example on Wikipedia by by Tim Stellmach (license CC-0): Example Sudoku

However as this Sudoku is unique, we leave out the numbers in the center block:

sample_sudoku <- with(grid_helper, index[
  i == 1 & j == 1 & v == 5 | 
  i == 1 & j == 2 & v == 3 | 
  i == 2 & j == 1 & v == 6 | 
  i == 3 & j == 2 & v == 9 | 
  i == 3 & j == 3 & v == 8 | 
                                                      
  i == 1 & j == 5 & v == 7 | 
  i == 2 & j == 4 & v == 1 | 
  i == 2 & j == 5 & v == 9 | 
  i == 2 & j == 6 & v == 5 |
  
  i == 3 & j == 8 & v == 6 | 
                                                      
  i == 4 & j == 1 & v == 8 | 
  i == 5 & j == 1 & v == 4 | 
  i == 6 & j == 1 & v == 7 |
                                                      
  #i == 4 & j == 5 & v == 6 | 
  #i == 5 & j == 4 & v == 8 | 
  #i == 5 & j == 6 & v == 3 | 
  #i == 6 & j == 5 & v == 2 |
  
  i == 4 & j == 9 & v == 3 | 
  i == 5 & j == 9 & v == 1 | 
  i == 6 & j == 9 & v == 6 | 
                                                      
  i == 7 & j == 2 & v == 6 |
  
  i == 8 & j == 4 & v == 4 | 
  i == 8 & j == 5 & v == 1 | 
  i == 8 & j == 6 & v == 9 | 
  i == 9 & j == 5 & v == 8 |
                                                      
  i == 7 & j == 7 & v == 2 | 
  i == 7 & j == 8 & v == 8 | 
  i == 8 & j == 9 & v == 5 | 
  i == 9 & j == 8 & v == 7 | 
  i == 9 & j == 9 & v == 9])
current_formula <- formula
results <- list()
no_results <- 0
system.time(while(TRUE) {
  result <- picosat_sat(current_formula, assumptions = sample_sudoku)
  if (picosat_solution_status(result) == "PICOSAT_SATISFIABLE") {
    no_results <- no_results + 1
    results[[no_results]] <- result
    new_clause <- -1 * (result$variable * ifelse(result$value, 1L, -1L))
    current_formula <- c(list(new_clause), current_formula)
  } else {
    break
  }
})
##    user  system elapsed 
##   1.603   0.033   1.672

The resulting Sudoku has 63 solutions and is therefore not unique.

Feedback

Do you have any questions, ideas, comments? Or did you find a mistake? Send me an email!