Problem: I am building an OCR-based engine to digitize chess scoresheets using Python and python-chess. I use the Levenshtein distance (specifically Jaro-Winkler) to map recognized text (e.g., "h4") to the list of currently legal moves.
The Issue: Sometimes the OCR misreads a move (e.g., it detects "h4" but the player actually wrote "d4").
The "Greedy Match" Problem: Since "h4" is a perfectly legal move in the current board state, my engine accepts it with a 100% similarity score.
The "Butterfly Effect": Once an incorrect but "legal" move is accepted, the internal board state becomes wrong. Consequently, all subsequent real moves (like "Kg3") are rejected because they are illegal in the corrupted state.
Current Code Snippet:
@staticmethod
def find_best_legal_move(board, ocr_move):
if not ocr_move or ocr_move.strip() == "":
return None
legal_moves = [board.san(m) for m in board.legal_moves]
if not legal_moves:
return ocr_move
best_move = max(legal_moves, key=lambda m: jaro_winkler(m, ocr_move))
return best_move
@staticmethod
def validate_game(json_data):
board = chess.Board()
corrected_moves = []
for item in json_data["moves"]:
move_no = item["move_no"]
white_raw = Engine.translate_to_en(item["white"])
black_raw = Engine.translate_to_en(item["black"])
if white_raw == "1/2":
break
white_fixed = Engine.find_best_legal_move(board,white_raw)
if white_fixed:
board.push_san(white_fixed)
black_fixed = Engine.find_best_legal_move(board,black_raw)
if black_fixed:
board.push_san(black_fixed)
corrected_moves.append({
"move_no": move_no,
"white": white_fixed if white_fixed else white_raw,
"black": black_fixed if black_fixed else black_raw
})
return {"metadata": json_data["metadata"], "moves": corrected_moves}
h4it could beb4,d4orh3(less likely). If you run into a sequence of impossible moves then backtrack to the previous move before the first illegal one. It might also be worth scoring the moves in a half decent chess engine to weed out all those crazy misread moves that a competent club player would never play.