1

I am trying to create a database for movielens (http://grouplens.org/datasets/movielens/). We've got movies and ratings. Movies have multiple genres. I splitted those out into a separate table since it's a 1:many relationship. There's a many:many relationship as well, users to movies. I need to be able to query this table multiple ways.

So I created:

CREATE TABLE genre (
  genre_id serial NOT NULL,
  genre_name char(20) DEFAULT NULL,
  PRIMARY KEY (genre_id)
) 

.

INSERT INTO genre VALUES 
  (1,'Action'),(2,'Adventure'),(3,'Animation'),(4,'Children\s'),(5,'Comedy'),(6,'Crime'),
  (7,'Documentary'),(8,'Drama'),(9,'Fantasy'),(10,'Film-Noir'),(11,'Horror'),(12,'Musical'),
  (13,'Mystery'),(14,'Romance'),(15,'Sci-Fi'),(16,'Thriller'),(17,'War'),(18,'Western');

.

CREATE TABLE movie (
  movie_id int NOT NULL DEFAULT '0',
  movie_name char(75) DEFAULT NULL,
  movie_year smallint DEFAULT NULL,
  PRIMARY KEY (movie_id)
  );

.

CREATE TABLE moviegenre (
  movie_id int NOT NULL DEFAULT '0',
  genre_id tinyint NOT NULL DEFAULT '0',
  PRIMARY KEY (movie_id, genre_id)
);

I dont know how to import my movies.csv with columns movie_id, movie_name and movie_genre For example, the first row is (1;Toy Story (1995);Animation|Children's|Comedy) If I INSERT manually, it should be look like:

INSERT INTO moviegenre VALUES (1,3),(1,4),(1,5)

Because 3 is Animation, 4 is Children and 5 is Comedy

How can I import all data set this way?

2
  • How do you upload your data, with COPY? Or you edit text files?
    – user_0
    Commented May 6, 2016 at 15:13
  • I upload my data with COPY
    – KTBFFH
    Commented May 6, 2016 at 15:20

1 Answer 1

1

You should first create a table that can ingest the data from the CSV file:

CREATE TABLE movies_csv (
  movie_id    integer, 
  movie_name  varchar,
  movie_genre varchar
);

Note that any single quotes (Children's) should be doubled (Children''s). Once the data is in this staging table you can copy the data over to the movie table, which should have the following structure:

CREATE TABLE movie (
  movie_id   integer, -- A primary key has implicit NOT NULL and should not have default
  movie_name varchar NOT NULL, -- Movie should have a name, varchar more flexible
  movie_year integer,          -- Regular integer is more efficient
  PRIMARY KEY (movie_id)
);

Sanitize your other tables likewise.

Now copy the data over, extracting the unadorned name and the year from the CSV name:

INSERT INTO movie (movie_id, movie_name)
  SELECT parts[1], parts[2]::integer
  FROM movies_csv, regexp_matches(movie_name, '([[:ascii:]]*)\s\(([\d]*)\)$') p(parts)

Here the regular expression says:

  • ([[:ascii:]]*) - Capture all characters until the matches below
  • \s - Read past a space
  • \( - Read past an opening parenthesis
  • ([\d]*) - Capture any digits
  • \) - Read past a closing parenthesis
  • $ - Match from the end of the string

So on input "Die Hard 17 (John lives forever) (2074)" it creates a string array with {'Die Hard 17 (John lives forever)', '2074'}. The scanning has to be from the end $, assuming all movie titles end with the year of publication in parentheses, in order to preserve parentheses and numbers in movie titles.

Now you can work on the movie genres. You have to split the string on the bar | using the regex_split_to_table() function and then join to the genre table on the genre name:

INSERT INTO moviegenre
  SELECT movie_id, genre_id
  FROM movies_csv, regexp_split_to_table(movie_genre, '\|') p(genre) -- escape the |
  JOIN genre ON genre.genre_name = p.genre;

After all is done and dusted you can delete the movies_csv table.

1
  • Excellent explanation! Thank you so much!
    – KTBFFH
    Commented May 6, 2016 at 16:41

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.