Teebone Ding Technical Blog

Django, Python, Javascript, Pig, and Hadoop.

Generate a YouTube-Video-ID-like Hash Function in Python

A youtube video URL is like https://www.youtube.com/watch?v=Qx-ezesM3nA. The video ID consists of 11 characters with ASCII uppercase and lowercase alphabets, digits ,and some other characters. In some cases we would like to generate a hash function that outputs a Youtube-video-ID-like hash value. For example: A URL shortener that generate a shorter URL from the original long URL. Here is my implementation:

util.py
1
2
3
4
5
6
7
8
9
import string, random
def genHash(seed):
    base = string.ascii_letters+string.digits # Output hash base: all alphabets and digits
    random.seed(seed) # Input string as the random seed
    hash_value = ""
    for i in range(10):
        # Generate a 10-character hash by randomly select characters from base
        hash_value += random.choice(base)
    return hash_value

The random.seed() function will determine the output random number sequence. Same seed will output same hash value (random number sequence) that is a 1-to-1 mapping between seed (input) and hash value (output).

Example output:

1
2
3
4
>>> genHash("DoReMi")
'pd7yPqNmDB'
>>> genHash("DoReMiFa")
'ZiM4CNISMF'

One could change the base string to generate different kind of hash that is based on python random module.

If you have any great idea of generating a Youtube-video-ID-like hash, please discuss with me below!

10/16 Update

Actually random.seed() is implemented by calling a hash function to generate random numbers. Here is the snippet of random.seed() from random.py:

random.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class Random(_random.Random):
    VERSION = 3     # used by getstate/setstate

    def seed(self, a=None, version=2):
        """Initialize internal state from hashable object.

        None or no argument seeds from current time or from an operating
        system specific randomness source if available.

        For version 2 (the default), all of the bits are used if *a *is a str,
        bytes, or bytearray.  For version 1, the hash() of *a* is used instead.

        If *a* is an int, all bits are used.

        """

        if a is None:
            try:
                a = int.from_bytes(_urandom(32), 'big')
            except NotImplementedError:
                import time
                a = int(time.time() * 256) # use fractional seconds

        if version == 2:
            if isinstance(a, (str, bytes, bytearray)):
                if isinstance(a, str):
                    a = a.encode("utf-8")
                a += _sha512(a).digest()
                a = int.from_bytes(a, 'big')

        super().seed(a)
        self.gauss_next = None

The seed is generated from SHA512 hash function. The random number sequence will be the same as the seed is the same.

Reference:

Python Random module implementation

Comments