5
\$\begingroup\$

I wrote this program over half a year ago, but I decided to not share it because the task seems trivial to me. Today I decided to post it here. As it is designed to work with strings and messiness of human language speed is not a concern, but I still optimized the program as best as I can, however I used conditionals very liberally to validate the input at every single step.

Now, what is an English numeral? It is something like this: "six hundred ninety-three vigintillion one hundred forty-seven novemdecillion one hundred eighty octodecillion five hundred fifty-nine septendecillion nine hundred forty-five sexdecillion three hundred nine quindecillion four hundred seventeen quattordecillion two hundred thirty-two tredecillion one hundred twenty-one duodecillion four hundred fifty-eight undecillion one hundred seventy-six decillion five hundred sixty-eight nonillion seventy-five octillion five hundred septillion one hundred thirty-four sextillion three hundred sixty quintillion two hundred fifty-five quadrillion two hundred fifty-four trillion one hundred twenty billion six hundred eighty million nine thousand four hundred ninety-three", which is equal to 693147180559945309417232121458176568075500134360255254120680009493 in Arabic numerals, or 693_147_180_559_945_309_417_232_121_458_176_568_075_500_134_360_255_254_120_680_009_493 with the thousand separator.

As you can see there is a lot of structure in the English numeral system, a lot of patterns, patterns that can be encoded algorithmically. My program supports a strict subset of English numeral, it supports English numeral with numeric value up to "nine hundred ninety-nine vigintillion nine hundred ninety-nine novemdecillion nine hundred ninety-nine octodecillion nine hundred ninety-nine septendecillion nine hundred ninety-nine sexdecillion nine hundred ninety-nine quindecillion nine hundred ninety-nine quattordecillion nine hundred ninety-nine tredecillion nine hundred ninety-nine duodecillion nine hundred ninety-nine undecillion nine hundred ninety-nine decillion nine hundred ninety-nine nonillion nine hundred ninety-nine octillion nine hundred ninety-nine septillion nine hundred ninety-nine sextillion nine hundred ninety-nine quintillion nine hundred ninety-nine quadrillion nine hundred ninety-nine trillion nine hundred ninety-nine billion nine hundred ninety-nine million nine hundred ninety-nine thousand nine hundred ninety-nine" (\$10^{66} - 1\$), and precisely it can convert two unvigintillion minus one (2E+66 - 1) numbers to and from English numeral.

More specifically it supports American short scale, which means it supports -illions but not -illiards, with each increase in -illion multiplying the scale by a thousand. And my program supports scale words up to vigintillion.

Now here are the rules:

  • Rule #1: An English numeral may be preceded by the words "positive" and "negative" exactly once, they set the sign of the numeral, these words are mutually exclusive. Meaning "positive ..." and "negative ..." are allowed, but "positive positive ...", "positive negative ...", "negative positive ...", "negative negative ..." and so on are disallowed.

  • Rule #2: An English numeral may contain the word "zero" exactly once, the word "zero" is mutually exclusive with all other numeric words, so there are only three valid English numerals containing the word "zero": "zero", "positive zero", "negative zero", all of them have the numeric value of 0.

  • Rule #3: An English numeral may contain the unit words "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", corresponding to numeric values 1 to 9, they may appear alone or in the rightmost place, in which case they modify the total numeric value by adding their corresponding numeric values, or they may appear to the right of words ending in -ty and a dash (-), in which case they modify the place value by adding their own numeric value to the place value, or they may appear to the left of the word "hundred" or words signifying powers of one thousand, in which case they modify the place numeric value by multiplying the place numeric value with their own numeric value.

  • Rule #4: An English numeral may contain the words: "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", corresponding to numeric values range(10, 20), they may appear alone or in the rightmost place, in which case they modify the total numeric value by adding their corresponding numeric values, or they may appear to the right of the word "hundred" or words signifying powers of one thousand, in which case they modify the place numeric value by adding the place numeric value with their own numeric value.

  • Rule #5: An English numeral may contain the words: "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety", corresponding to numeric values range(20, 91, 10), they may appear alone or in the rightmost place, in which case they modify the total numeric value by adding their corresponding numeric values, or they may appear to the right of the word "hundred" or words signifying powers of one thousand or to the left of a dash before a unit word, in which case they modify the place numeric value by adding the place numeric value with their own numeric value.

  • Rule #6: An English numeral may contain the word "hundred" corresponding to 100, it may appear after the word "a" in which case no other words are allowed after it, or it may appear after the unit words in which case they modify the place numeric value by adding the unit numeric value multiplied with 100 to the place value.

  • Rule #7: An English numeral may contain any number of chunks each corresponding to a numeric value less than 1000, a chunk may be in this format "{unit_word}|{teen_word}|{ty_word}" in which case it has the numeric value of the word consisting the chunk, or a chunk may be in this format: "{unit_word} hundred (and )? ({teen_word}|{ty_word})", in which case the place value is the first times 100 added with the second, or a chunk may be in this format: "{unit_word} hundred (and )? {ty_word}-{unit_word}", in which case the place value is first * 100 + second + third.

  • Rule #8: An English numeral may contain the words: "thousand", "million", "billion", "trillion", "quadrillion", "quintillion", "sextillion", "septillion", "octillion", "nonillion", "decillion", "undecillion", "duodecillion", "tredecillion", "quattordecillion", "quindecillion", "sexdecillion", "septendecillion", "octodecillion", "novemdecillion", "vigintillion", corresponding to numeric values \$\left(10^{3i}|1 \le i \le 21\right)\$, they each must be preceded by a chunk, and they modify the total numeric value by adding the place value of the chunk preceding them multiplied with their own numeric value. A chunk may consist of the word "a" which has a numeric value of 1 and it must be followed by exactly one scale word and nothing else, it must not be preceded by a sign word. All scale words must be separated by chunks, all scale words must be appearing in strictly descending order, meaning no scale word can appear multiple times in an English numeral and for any scale word all scale words appearing after it must be strictly smaller than it.

And these are all we need to implement an English numeral converter.

As for Roman numerals, it is much simpler:

  • Rule #1: Roman numerals have digits {"I": 1, "V": 5, "X": 10, "L": 50, "C": 100, "D": 500, "M": 1000}

  • Rule #2: Roman numeral digits may appear alone or appear up to three times in a row, this is a chunk with numeric value equal to the digit value times the occurrences.

  • Rule #3: "IV" is 4, "IX" is 9, "XL" is 40, "XC" is 90, "CD" is 400 and "CM" is 900.

  • Rule #4: A Roman numeral can have up to four chunks in strictly descending order, no two chunks with the same order of magnitude can appear in a Roman numeral, and the numeric of a Roman numeral is the sum of the numeric values of the chunks.

And I have just explained all the code in my program.

Here is the code:

# fmt: off
LT_100 = {
    "one":           b"001", "two":           b"002", "three":         b"003", "four":          b"004", "five":          b"005",
    "six":           b"006", "seven":         b"007", "eight":         b"008", "nine":          b"009", "ten":           b"010",
    "eleven":        b"011", "twelve":        b"012", "thirteen":      b"013", "fourteen":      b"014", "fifteen":       b"015",
    "sixteen":       b"016", "seventeen":     b"017", "eighteen":      b"018", "nineteen":      b"019", "twenty":        b"020",
    "twenty-one":    b"021", "twenty-two":    b"022", "twenty-three":  b"023", "twenty-four":   b"024", "twenty-five":   b"025",
    "twenty-six":    b"026", "twenty-seven":  b"027", "twenty-eight":  b"028", "twenty-nine":   b"029", "thirty":        b"030",
    "thirty-one":    b"031", "thirty-two":    b"032", "thirty-three":  b"033", "thirty-four":   b"034", "thirty-five":   b"035",
    "thirty-six":    b"036", "thirty-seven":  b"037", "thirty-eight":  b"038", "thirty-nine":   b"039", "forty":         b"040",
    "forty-one":     b"041", "forty-two":     b"042", "forty-three":   b"043", "forty-four":    b"044", "forty-five":    b"045",
    "forty-six":     b"046", "forty-seven":   b"047", "forty-eight":   b"048", "forty-nine":    b"049", "fifty":         b"050",
    "fifty-one":     b"051", "fifty-two":     b"052", "fifty-three":   b"053", "fifty-four":    b"054", "fifty-five":    b"055",
    "fifty-six":     b"056", "fifty-seven":   b"057", "fifty-eight":   b"058", "fifty-nine":    b"059", "sixty":         b"060",
    "sixty-one":     b"061", "sixty-two":     b"062", "sixty-three":   b"063", "sixty-four":    b"064", "sixty-five":    b"065",
    "sixty-six":     b"066", "sixty-seven":   b"067", "sixty-eight":   b"068", "sixty-nine":    b"069", "seventy":       b"070",
    "seventy-one":   b"071", "seventy-two":   b"072", "seventy-three": b"073", "seventy-four":  b"074", "seventy-five":  b"075",
    "seventy-six":   b"076", "seventy-seven": b"077", "seventy-eight": b"078", "seventy-nine":  b"079", "eighty":        b"080",
    "eighty-one":    b"081", "eighty-two":    b"082", "eighty-three":  b"083", "eighty-four":   b"084", "eighty-five":   b"085",
    "eighty-six":    b"086", "eighty-seven":  b"087", "eighty-eight":  b"088", "eighty-nine":   b"089", "ninety":        b"090",
    "ninety-one":    b"091", "ninety-two":    b"092", "ninety-three":  b"093", "ninety-four":   b"094", "ninety-five":   b"095",
    "ninety-six":    b"096", "ninety-seven":  b"097", "ninety-eight":  b"098", "ninety-nine":   b"099"
}

SCALES = {
    "thousand":         3,
    "million":          6,
    "billion":          9,
    "trillion":         12,
    "quadrillion":      15,
    "quintillion":      18,
    "sextillion":       21,
    "septillion":       24,
    "octillion":        27,
    "nonillion":        30,
    "decillion":        33,
    "undecillion":      36,
    "duodecillion":     39,
    "tredecillion":     42,
    "quattordecillion": 45,
    "quindecillion":    48,
    "sexdecillion":     51,
    "septendecillion":  54,
    "octodecillion":    57,
    "novemdecillion":   60,
    "vigintillion":     63,
}

PLACES = {"hundred": 100} | { scale: 10**power for scale, power in SCALES.items() }

REV_LT_100 = [
    ""             , "one"          , "two"          , "three"        , "four"         ,
    "five"         , "six"          , "seven"        , "eight"        , "nine"         ,
    "ten"          , "eleven"       , "twelve"       , "thirteen"     , "fourteen"     ,
    "fifteen"      , "sixteen"      , "seventeen"    , "eighteen"     , "nineteen"     ,
    "twenty"       , "twenty-one"   , "twenty-two"   , "twenty-three" , "twenty-four"  ,
    "twenty-five"  , "twenty-six"   , "twenty-seven" , "twenty-eight" , "twenty-nine"  ,
    "thirty"       , "thirty-one"   , "thirty-two"   , "thirty-three" , "thirty-four"  ,
    "thirty-five"  , "thirty-six"   , "thirty-seven" , "thirty-eight" , "thirty-nine"  ,
    "forty"        , "forty-one"    , "forty-two"    , "forty-three"  , "forty-four"   ,
    "forty-five"   , "forty-six"    , "forty-seven"  , "forty-eight"  , "forty-nine"   ,
    "fifty"        , "fifty-one"    , "fifty-two"    , "fifty-three"  , "fifty-four"   ,
    "fifty-five"   , "fifty-six"    , "fifty-seven"  , "fifty-eight"  , "fifty-nine"   ,
    "sixty"        , "sixty-one"    , "sixty-two"    , "sixty-three"  , "sixty-four"   ,
    "sixty-five"   , "sixty-six"    , "sixty-seven"  , "sixty-eight"  , "sixty-nine"   ,
    "seventy"      , "seventy-one"  , "seventy-two"  , "seventy-three", "seventy-four" ,
    "seventy-five" , "seventy-six"  , "seventy-seven", "seventy-eight", "seventy-nine" ,
    "eighty"       , "eighty-one"   , "eighty-two"   , "eighty-three" , "eighty-four"  ,
    "eighty-five"  , "eighty-six"   , "eighty-seven" , "eighty-eight" , "eighty-nine"  ,
    "ninety"       , "ninety-one"   , "ninety-two"   , "ninety-three" , "ninety-four"  ,
    "ninety-five"  , "ninety-six"   , "ninety-seven" , "ninety-eight" , "ninety-nine"
]

HUNDREDS = [
    ""             , "one hundred"  , "two hundred"  , "three hundred", "four hundred" ,
    "five hundred" , "six hundred"  , "seven hundred", "eight hundred", "nine hundred" ,
]

SCALES_LIST = [
    "thousand"        , "million"         , "billion"         ,
    "trillion"        , "quadrillion"     , "quintillion"     ,
    "sextillion"      , "septillion"      , "octillion"       ,
    "nonillion"       , "decillion"       , "undecillion"     ,
    "duodecillion"    , "tredecillion"    , "quattordecillion",
    "quindecillion"   , "sexdecillion"    , "septendecillion" ,
    "octodecillion"   , "novemdecillion"  , "vigintillion"
]

ROMAN_UNITS = ["", "I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX"]

ROMAN_TENS = ["", "X", "XX", "XXX", "XL", "L", "LX", "LXX", "LXXX", "XC"]

ROMAN_HUNDREDS = ["", "C", "CC", "CCC", "CD", "D", "DC", "DCC", "DCCC", "CM"]

ROMAN_THOUSANDS = ["", "M", "MM", "MMM"]

ROMAN_NUMERALS = [
    ("VIII", 4, 1, 8   ), ("LXXX", 4, 2, 80  ), ("DCCC", 4, 3, 800 ),
    ("III" , 3, 1, 3   ), ("VII" , 3, 1, 7   ), ("XXX" , 3, 2, 30  ),
    ("LXX" , 3, 2, 70  ), ("CCC" , 3, 3, 300 ), ("DCC" , 3, 3, 700 ),
    ("MMM" , 3, 4, 3000), ("II"  , 2, 1, 2   ), ("IV"  , 2, 1, 4   ),
    ("VI"  , 2, 1, 6   ), ("IX"  , 2, 1, 9   ), ("XX"  , 2, 2, 20  ),
    ("XL"  , 2, 2, 40  ), ("LX"  , 2, 2, 60  ), ("XC"  , 2, 2, 90  ),
    ("CC"  , 2, 3, 200 ), ("CD"  , 2, 3, 400 ), ("DC"  , 2, 3, 600 ),
    ("CM"  , 2, 3, 900 ), ("MM"  , 2, 4, 2000), ("I"   , 1, 1, 1   ),
    ("V"   , 1, 1, 5   ), ("X"   , 1, 2, 10  ), ("L"   , 1, 2, 50  ),
    ("C"   , 1, 3, 100 ), ("D"   , 1, 3, 500 ), ("M"   , 1, 4, 1000)
]
# fmt: on


def parse_Roman_numeral(numeral: str) -> int:
    if not isinstance(numeral, str):
        raise TypeError("argument numeral must be str")

    if numeral in {"", "N"}:
        return 0

    prev_len = 5
    number = 0
    pos = 0
    end = len(numeral)
    while pos < end:
        for chunk, length, places, value in ROMAN_NUMERALS:
            next_pos = pos + length
            if numeral[pos:next_pos] == chunk:
                pos = next_pos
                break
        else:
            raise ValueError("detected invalid digit in Roman numeral")

        if (cur_len := places) >= prev_len:
            raise ValueError("Roman numerals were written in strictly descending order")

        number += value
        prev_len = cur_len

    return number


def to_Roman_numeral(number: int) -> str:
    if not isinstance(number, int):
        raise TypeError("argument number must be int")

    if not number:
        return "N"

    if number < 0 or number > 3999:
        raise ValueError("Roman numerals can only represent numbers 0 to 3999")

    numeral = ""
    q, r = divmod(number, 1000)
    numeral += ROMAN_THOUSANDS[q]
    q, r = divmod(r, 100)
    numeral += ROMAN_HUNDREDS[q]
    q, r = divmod(r, 10)
    numeral += ROMAN_TENS[q]
    return numeral + ROMAN_UNITS[r]


def handle_singleton(length: int, words: list[str], values: dict) -> None:
    if length != 2:
        raise ValueError(
            "'a' must be followed by exactly one scale word and nothing else"
        )

    second = words[1]
    if not (power := PLACES.get(second)):
        raise ValueError("the word after 'a' isn't a scale word")

    values |= {"found": 1, "number": power}


def preprocess_English_numeral_worker(
    first: str, length: int, words: list[str], values: dict
) -> None:
    negative = first == "negative"
    if negative or first == "positive":
        if length == 1:
            raise ValueError("sign must modify a number")

        words.pop(0)
        length = len(words)

    if words[0] == "zero":
        if length != 1:
            raise ValueError(
                "only the following valid English numerals can contain zero: 'negative zero', 'zero', 'positive zero'"
            )

        values |= {"found": 1, "number": 0}
    else:
        values |= {"sign": "-" * negative, "words": words}


def preprocess_English_numeral(numeral: str, values: dict) -> None:
    numeral = numeral.strip().lower()
    if not numeral:
        raise ValueError("empty string isn't valid English numeral")

    words = numeral.split(" ")
    first = words[0]
    length = len(words)
    if first == "a":
        handle_singleton(length, words, values)
    else:
        preprocess_English_numeral_worker(first, length, words, values)


def handle_special(
    word: str, multiplier: bytes, allow_and: bool
) -> tuple[int, int, int, int]:
    if word == "hundred":
        if multiplier[:2] != b"00" or multiplier[2] == 48:
            raise ValueError(
                "'hundred' must be preceded by a number between one and nine"
            )

        return 1, 1, 0, multiplier[::-1]

    if allow_and and word == "and":
        return 1, 0, 1, multiplier

    return 0, 0, 0, multiplier


def handle_multiplier(
    partial: int, multiplier: bytes, multiplier_count: int
) -> tuple[int, int]:
    if multiplier == b"000":
        return partial, 1
    elif (first := multiplier[:1]) != b"0" and multiplier_count == 1:
        return first + partial[1:], 2
    else:
        raise ValueError("two separate multipliers cannot be adjacent")


def parse_English_numeral_worker(words: list[str]) -> list[tuple[bytes, int]]:
    multiplier_count = scale_power = allow_and = 0
    multiplier = b"000"
    last = 64
    chunks = []
    for word in words:
        skip, allow_and, used_and, multiplier = handle_special(
            word, multiplier, allow_and
        )
        if skip:
            continue

        if partial := LT_100.get(word):
            multiplier, multiplier_count = handle_multiplier(
                partial, multiplier, multiplier_count
            )
            scale_power = 0
        else:
            if (
                used_and
                or multiplier == b"000"
                or not (scale_power := SCALES.get(word))
                or scale_power >= last
            ):
                raise ValueError(
                    f"detected illegal word in standard short scale English numeral of integers for current position: {word}"
                )

            chunks.append((multiplier, scale_power))
            multiplier_count = 0
            multiplier = b"000"
            last = scale_power

        allow_and = used_and = False

    if not scale_power:
        chunks.append((multiplier, 0))

    return chunks


def parse_English_numeral(numeral: str) -> int:
    if not isinstance(numeral, str):
        raise TypeError("argument numeral must be str")

    values = {"found": False}
    preprocess_English_numeral(numeral, values)
    if values["found"]:
        return values["number"]

    chunks = parse_English_numeral_worker(values["words"])
    length = chunks[0][1] + 3
    digits = bytearray(
        b"0" * length,
    )
    for chunk, shift in chunks:
        end = length - shift
        digits[end - 3 : end] = chunk

    digits = digits.decode()
    digits.lstrip()
    return int(values["sign"] + digits)


def to_English_numeral(number: int) -> str:
    if not isinstance(number, int):
        raise TypeError("argument number must be int")

    if not number:
        return "zero"

    sign = ""
    if number < 0:
        number = -number
        sign = "negative "

    decimal = str(number)
    length = len(decimal)
    if length >= 67:
        raise ValueError("this function only supports scale words up to vigintillion")

    mod = length % 3
    padding = (3 - mod) * bool(mod)
    length += padding
    decimal = "0" * padding + decimal
    chunks = []
    hundreds, remainder = divmod(int(decimal[-3:]), 100)
    if remainder:
        chunks.append(REV_LT_100[remainder])
    if hundreds:
        chunks.append(HUNDREDS[hundreds])
    for index, (start, end) in enumerate(
        zip(range(length - 6, -1, -3), range(length - 3, 0, -3))
    ):
        if chunk := int(decimal[start:end]):
            hundreds, remainder = divmod(chunk, 100)
            chunks.append(SCALES_LIST[index])
            if remainder:
                chunks.append(REV_LT_100[remainder])
            if hundreds:
                chunks.append(HUNDREDS[hundreds])

    chunks.reverse()
    return sign + " ".join(chunks)

Test:

In [2]: to_English_numeral(299792458)
Out[2]: 'two hundred ninety-nine million seven hundred ninety-two thousand four hundred fifty-eight'

In [3]: parse_English_numeral('two hundred ninety-nine million seven hundred ninety-two thousand four hundred fifty-eight')
Out[3]: 299792458

In [4]: to_English_numeral(86400)
Out[4]: 'eighty-six thousand four hundred'

In [5]: parse_English_numeral('eighty-six thousand four hundred')
Out[5]: 86400

In [6]: to_English_numeral(31556952)
Out[6]: 'thirty-one million five hundred fifty-six thousand nine hundred fifty-two'

In [7]: parse_English_numeral('thirty-one million five hundred fifty-six thousand nine hundred fifty-two')
Out[7]: 31556952

In [8]: to_English_numeral(9460536207068016)
Out[8]: 'nine quadrillion four hundred sixty trillion five hundred thirty-six billion two hundred seven million sixty-eight thousand sixteen'

In [9]: parse_English_numeral('nine quadrillion four hundred sixty trillion five hundred thirty-six billion two hundred seven million sixty-eight thousand sixteen')
Out[9]: 9460536207068016

In [10]: to_English_numeral(86315)
Out[10]: 'eighty-six thousand three hundred fifteen'

In [11]: parse_English_numeral('eighty-six thousand three hundred fifteen')
Out[11]: 86315

In [12]: to_English_numeral(-314159265358979323846264338327950288419716939937510582097494459230)
Out[12]: 'negative three hundred fourteen vigintillion one hundred fifty-nine novemdecillion two hundred sixty-five octodecillion three hundred fifty-eight septendecillion nine hundred seventy-nine sexdecillion three hundred twenty-three quindecillion eight hundred forty-six quattordecillion two hundred sixty-four tredecillion three hundred thirty-eight duodecillion three hundred twenty-seven undecillion nine hundred fifty decillion two hundred eighty-eight nonillion four hundred nineteen octillion seven hundred sixteen septillion nine hundred thirty-nine sextillion nine hundred thirty-seven quintillion five hundred ten quadrillion five hundred eighty-two trillion ninety-seven billion four hundred ninety-four million four hundred fifty-nine thousand two hundred thirty'

In [13]: parse_English_numeral('negative three hundred fourteen vigintillion one hundred fifty-nine novemdecillion two hundred sixty-five octodecillion three hundred fifty-eight septendecillion nin
    ...: e hundred seventy-nine sexdecillion three hundred twenty-three quindecillion eight hundred forty-six quattordecillion two hundred sixty-four tredecillion three hundred thirty-eight duodeci
    ...: llion three hundred twenty-seven undecillion nine hundred fifty decillion two hundred eighty-eight nonillion four hundred nineteen octillion seven hundred sixteen septillion nine hundred t
    ...: hirty-nine sextillion nine hundred thirty-seven quintillion five hundred ten quadrillion five hundred eighty-two trillion ninety-seven billion four hundred ninety-four million four hundred
    ...:  fifty-nine thousand two hundred thirty')
Out[13]: -314159265358979323846264338327950288419716939937510582097494459230

In [14]: to_English_numeral(100200300400500600700800900)
Out[14]: 'one hundred septillion two hundred sextillion three hundred quintillion four hundred quadrillion five hundred trillion six hundred billion seven hundred million eight hundred thousand nine hundred'

In [15]: parse_English_numeral('one hundred septillion two hundred sextillion three hundred quintillion four hundred quadrillion five hundred trillion six hundred billion seven hundred million eigh
    ...: t hundred thousand nine hundred')
Out[15]: 100200300400500600700800900

In [16]: to_English_numeral(10**42)
Out[16]: 'one tredecillion'

In [17]: parse_English_numeral('one tredecillion')
Out[17]: 1000000000000000000000000000000000000000000

In [18]: parse_English_numeral("vigintillion")
---------------------------------------------------------------------------
...
ValueError: detected illegal word in standard short scale English numeral of integers for current position: vigintillion

In [19]: parse_English_numeral("a vigintillion")
Out[19]: 1000000000000000000000000000000000000000000000000000000000000000

In [20]: parse_English_numeral("one vigintillion")
Out[20]: 1000000000000000000000000000000000000000000000000000000000000000

In [21]: parse_English_numeral("one vigintillion million")
---------------------------------------------------------------------------
...
ValueError: detected illegal word in standard short scale English numeral of integers for current position: million

In [22]: parse_English_numeral("one vigintillion a million")
---------------------------------------------------------------------------
...
ValueError: detected illegal word in standard short scale English numeral of integers for current position: a

In [23]: parse_English_numeral("one vigintillion one million")
Out[23]: 1000000000000000000000000000000000000000000000000000000001000000

In [24]: parse_English_numeral("hundred")
---------------------------------------------------------------------------
...
ValueError: 'hundred' must be preceded by a number between one and nine

In [25]: parse_English_numeral("a hundred")
Out[25]: 100

In [26]: parse_English_numeral("hundred thousand")
---------------------------------------------------------------------------
...
ValueError: 'hundred' must be preceded by a number between one and nine

In [27]: parse_English_numeral("nine hundred thousand")
Out[27]: 900000

In [28]: parse_English_numeral("nine hundred hundred")
---------------------------------------------------------------------------
...
ValueError: 'hundred' must be preceded by a number between one and nine

In [29]: parse_English_numeral("nine hundred thousand thousand")
---------------------------------------------------------------------------
...
ValueError: detected illegal word in standard short scale English numeral of integers for current position: thousand

In [30]: parse_English_numeral("one billion billion billion billion billion billion billion billion billion billion")
---------------------------------------------------------------------------
...
ValueError: detected illegal word in standard short scale English numeral of integers for current position: billion

In [31]: parse_English_numeral("one novemdecillion")
Out[31]: 1000000000000000000000000000000000000000000000000000000000000

In [32]: parse_English_numeral("nineteen hundred")
---------------------------------------------------------------------------
...
ValueError: 'hundred' must be preceded by a number between one and nine

In [33]: parse_English_numeral("one thousand nine hundred")
Out[33]: 1900

In [34]: parse_English_numeral("one thousand and hundred")
---------------------------------------------------------------------------
...
ValueError: detected illegal word in standard short scale English numeral of integers for current position: and

In [35]: parse_English_numeral("one thousand and one")
---------------------------------------------------------------------------
...
ValueError: detected illegal word in standard short scale English numeral of integers for current position: and

In [36]: parse_English_numeral("one thousand zero hundred and one")
---------------------------------------------------------------------------
...
ValueError: detected illegal word in standard short scale English numeral of integers for current position: zero

In [37]: parse_English_numeral("one thousand one hundred and one")
Out[37]: 1101

In [38]: for i in range(4000):
    ...:     assert parse_Roman_numeral(to_Roman_numeral(i)) == i

In [39]: max((to_Roman_numeral(i) for i in range(4000)), key=len)
Out[39]: 'MMMDCCCLXXXVIII'

In [40]: parse_Roman_numeral('MMMDCCCLXXXVIII')
Out[40]: 3888

In [41]: to_English_numeral(693147180559945309417232121458176568075500134360255254120680009493)
Out[41]: 'six hundred ninety-three vigintillion one hundred forty-seven novemdecillion one hundred eighty octodecillion five hundred fifty-nine septendecillion nine hundred forty-five sexdecillion three hundred nine quindecillion four hundred seventeen quattordecillion two hundred thirty-two tredecillion one hundred twenty-one duodecillion four hundred fifty-eight undecillion one hundred seventy-six decillion five hundred sixty-eight nonillion seventy-five octillion five hundred septillion one hundred thirty-four sextillion three hundred sixty quintillion two hundred fifty-five quadrillion two hundred fifty-four trillion one hundred twenty billion six hundred eighty million nine thousand four hundred ninety-three'

In [42]: to_English_numeral(10**66-1)
Out[42]: 'nine hundred ninety-nine vigintillion nine hundred ninety-nine novemdecillion nine hundred ninety-nine octodecillion nine hundred ninety-nine septendecillion nine hundred ninety-nine sexdecillion nine hundred ninety-nine quindecillion nine hundred ninety-nine quattordecillion nine hundred ninety-nine tredecillion nine hundred ninety-nine duodecillion nine hundred ninety-nine undecillion nine hundred ninety-nine decillion nine hundred ninety-nine nonillion nine hundred ninety-nine octillion nine hundred ninety-nine septillion nine hundred ninety-nine sextillion nine hundred ninety-nine quintillion nine hundred ninety-nine quadrillion nine hundred ninety-nine trillion nine hundred ninety-nine billion nine hundred ninety-nine million nine hundred ninety-nine thousand nine hundred ninety-nine'

In [43]: parse_English_numeral("a hundred one")
---------------------------------------------------------------------------
...
ValueError: 'a' must be followed by exactly one scale word and nothing else

In [44]: parse_English_numeral("twenty twenty")
---------------------------------------------------------------------------
...
ValueError: two separate multipliers cannot be adjacent

How can my program be improved?


Update

Per the rules of classical Roman numerals, specifically the rule about no consecutive occurrences of the same digit with length more than three, it is easy to see that the maximum number that can be expressed as classical Roman numeral is 3999, which is "MMMCMXCIX" in Roman numeral. Because for numbers 4000 and up, Romans didn't have digits for them, so they would need "MMMM..." which aren't allowed. Romans also didn't have the concept of zero but it is just silly, modern extensions of Roman numerals usually use "N" to denote 0. Also Romans denied negative numbers, that is silly too, but in today's usage Roman numerals are typically used to represent positive numbers only.


Update 1

To respond to a certain comment, although it is likely a joke, here is a serious response.

In Python or any programming language I know there is no signed 0 integer, to be more precise computers use two's complement which is inverting the bits and add one. For example, in int64_t "seven quintillion five hundred forty quadrillion one hundred thirteen trillion eight hundred four billion seven hundred forty-six million three hundred forty-six thousand four hundred twenty-nine" is 0b0110100010100011110111011000111001100001111011001100111110111101. To get the negative we invert the bits and add one. To invert the bits, XOR with all ones, 0b0110100010100011110111011000111001100001111011001100111110111101 ^ (1 << 64) - 1 = 0b1001011101011100001000100111000110011110000100110011000001000010. Then we add one to get 0b1001011101011100001000100111000110011110000100110011000001000011.

It may not seem much, but this forces the nonexistence of signed zero. Why? To start, when the left most bit is set, does that mean it is a positive number greater than 9223372036854775808 or a negative number smaller than -9223372036854775808? To resolve the ambiguity and to ensure unique representation for all integers, we must sacrifice the leftmost bit to denote the sign, and so we can only use 63 bits for the numeric value. Thus if the leftmost bit is set, the number is negative, if it isn't then it is positive.

Now, this also means the smallest number we can represent using 64 bits is -9223372036854775808. Because it is one followed by 63 zeroes, after inverting we have 63 ones and add one we propagate the carry all the way to the left and we get one followed by 63 zeroes. Consequently if any bit other than the first bit is set, it becomes 0 after inverting, thus it stops the carry propagation after adding one. For example, 0b1111111111111111111111111111111111111111111111111111111111111111 is -1 in int64_t.

Now what is the negative of 0? 0 is all zeros, after inverting we have all ones, after adding one we need to propagate the carry all the way through to the left, we would need one more bit to store the number than what we started, for example in int64_t we need 1 followed by 64 0s, but that requires 65 bits and we only have 64, so we ignore the leading bit thus -0 = +0. Alternatively after inverting we get all ones, which is negative one, negative one plus one is zero.

But what about signed zero? That is because of floating point. I don't want to go into the details, but floating points cannot natively represent 0. Because two to any power isn't zero, \$\log_2(0)\$ is undefined (or in any base for that matter). Thus it is a special case, when the exponent and mantissa are all zeroes, the number is treated as 0, and the first bit is the sign bit, giving positive zero and negative zero.

But if the mantissa is all zeroes would that mean zero too? It could, but it breaks uniqueness of representations, the same number can be represented in different ways by adding/subtracting the exponent and correspondingly right-shifting/left-shifting the mantissa, so 52 bits of the mantissa is explicitly stored with the first bit of the mantissa is implicitly set as 1 except in special cases.

In [55]: -0
Out[55]: 0

In [56]: +0
Out[56]: 0

In [57]: -0.0
Out[57]: -0.0

In [58]: +0.0
Out[58]: 0.0

Update 2

I thought my variable name was self-evident, or the contents of the container should make the variable name self-explanatory. Apparently I was wrong.

LT_100 contains all the positive integers below one hundred in English numeral, so it shouldn't be hard to guess that LT_100 means Less Than 100. And similarly REV_LT_100 is the REVerse of LT_100.

As for why I chose that name, well, don't blame me, I trust most Python programmers know what __lt__, __eq__, __gt__, __le__, __ne__, __ge__ mean.

\$\endgroup\$
5
  • \$\begingroup\$ Does Python not distinguish between +0 and -0? \$\endgroup\$ Commented 2 days ago
  • 1
    \$\begingroup\$ @TobySpeight it only does in the floating point case, i.e. -0. \$\endgroup\$ Commented 2 days ago
  • 1
    \$\begingroup\$ Thanks @Reinderien - that's informative. The edit to the question implies that Python requires a twos-complement representation for integers (which precludes differently-signed zero, unlike e.g. sign+magnitude representations) - that's another thing I didn't know. \$\endgroup\$ Commented 2 days ago
  • 2
    \$\begingroup\$ (1) The principal application of such a program would be with regard to calendar years. As such you will have to employ BC or AD after the number concerned. This gets rid of your + and - issue. (2) Actually Roman numerals do allow numbers higher than 3999. c.f. Vinculum and note that there are unicode characters for these characters, e.g. in Java "MV" + "\u0305", // used for 4XXX. I did a converter in Java some years back as an exercise in test-driven development. Most effort went into standardizing and validating user input. \$\endgroup\$ Commented yesterday
  • \$\begingroup\$ Although it seems unlikely to me that someone would want to convert the Roman number IIII, Why Do Clocks and Watches Use the Roman Numeral IIII instead of IV? lists some possible reasons for the latter being so. \$\endgroup\$ Commented 14 hours ago

3 Answers 3

8
\$\begingroup\$

Bug: digits.lstrip() result is discarded

In parse_English_numeral:

digits = digits.decode()
digits.lstrip()
return int(values["sign"] + digits)

In the above str.lstrip() doesn't strip in-place; the return value is silently thrown away. This doesn't actually cause a visible bug because int() tolerates leading zeros, but it's dead code, and if you ever change the return type it could bite you. The fix is either digits = digits.lstrip("0") or "0" or just rely on int() to do the right thing and delete the call entirely.

Data duplication between parallel structures

You maintain LT_100 (apparently a dict for parsing) and REV_LT_100 (a list for generation) as entirely separate and manually written tables with 99 entries each. Same goes for SCALES / SCALES_LIST. These are easy to get out of sync in future maintenance. Consider deriving one from the other:

# you can build REV_LT_100 from LT_100 once at module load
REV_LT_100 = [""] * 100
for word, digits in LT_100.items():
    REV_LT_100[int(digits)] = word

# then SCALES_LIST is just the keys of SCALES in insertion order
SCALES_LIST = list(SCALES)

Note: python dicts have preserved insertion order since 3.7, so this should be safe. It also means your ground truth lives in exactly one place.

The bytes manipulation trick deserves a comment

This is how the multiplier works: b"005" represents 005 as ASCII digits, and multiplier[::-1] converts a units-place value like b"005" into a hundreds-place value b"500". Kinda clever, but also quite opaque to a reader. A short comment at the top of parse_English_numeral_worker explaining the encoding would make the logic much easier to follow without needing to reverse-engineer it (as I personally did).

Magic sentinel last = 64

multiplier_count = scale_power = allow_and = 0
...
last = 64

The number 64 works because the largest scale power is 63 (vigintillion), but this is a magic number. A reader has to already know the table to understand why 64 is safe. Something like:

last = max(SCALES.values()) + 1  # sentinel: larger than any valid scale

or perhaps simply float('inf') would make the intent self-documenting.


Also, the values dict as an out-parameter is worth reconsidering. preprocess_English_numeral (and friends) mutate a shared values dict rather than returning results. This pattern is borrowed from languages that lack multiple return values, but python could handle it cleanly with either a namedtuple or a dataclass:

from typing import NamedTuple


class PreprocessResult(NamedTuple):
    found: bool
    number: int = 0
    sign: str = ""
    words: list[str] = []

This would make the contract of each function explicit in its signature rather than in its documentation (or the reader's imagination :D).

\$\endgroup\$
5
\$\begingroup\$

Error checking

It is great that you provide error checking of user input. Messages like this are very easy to understand:

raise ValueError("Roman numerals can only represent numbers 0 to 3999")

I think this message could use a little elaboration:

raise ValueError("this function only supports scale words up to vigintillion")

Most people have never even see the word "vigintillion" before, let alone know what it means:

raise ValueError("this function only supports scale words up to vigintillion (63 digits)")

Even my spell-checker doesn't recognize it :)

DRY

SCALES and SCALES_LIST are very similar. I think you can programmatically derive SCALES_LIST from SCALES. The same goes for REV_LT_100 and LT_100.

Naming

The constant name LT_100 is a bit cryptic. I don't know what "LT" means, and I think it would be better to spell out.

Documentation

It would be nice to add some docstring to functions to distinguish those with very similar names like:

parse_English_numeral_worker
parse_English_numeral
\$\endgroup\$
4
\$\begingroup\$

Documentation

Docstrings for your function and the module would be a big improvement.

Organization

There are a lot of constants here. You may wish to put them into a separate file that you import.

Quibbles

You can check for membership in a range here.

    if number < 0 or number > 3999:

Revised:

    if number not in range(4000):

Further, what does 3999 signify? Watch out for magic numbers.

In the following, I'd probably put both of the exception-raising guards first.

def to_Roman_numeral(number: int) -> str:
    if not isinstance(number, int):
        raise TypeError("argument number must be int")

    if not number:
        return "N"

    if number < 0 or number > 3999:
        raise ValueError("Roman numerals can only represent numbers 0 to 3999")

Vs.

def to_Roman_numeral(number: int) -> str:
    if not isinstance(number, int):
        raise TypeError("argument number must be int")

    if number < 0 or number > 3999:
        raise ValueError("Roman numerals can only represent numbers 0 to 3999")

    if not number:
        return "N"

The second variable is only used once. I would skip it and just use words[1] directly.

def handle_singleton(length: int, words: list[str], values: dict) -> None:
    if length != 2:
        raise ValueError(
            "'a' must be followed by exactly one scale word and nothing else"
        )

    second = words[1]
    if not (power := PLACES.get(second)):
        raise ValueError("the word after 'a' isn't a scale word")

    values |= {"found": 1, "number": power}

You're doing some manual padding that may be better off using an f-string.

    decimal = "0" * padding + decimal

Vs.

    decimal = f"{decimal:0{length}}"

Public functions

Docstrings would help clarify: Are any "preprocess" functions meant to be called by a user of this module, or should they start with an underscore and be restricted to use by other functions in the module?

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.