0
$\begingroup$

I have 3 variables that I would like to use to build my dataset but since they are in a weird shape/format, I had no success so far. I'm quite new to this and really appreciate any help!!

The 3 variables I have are:

print(newspaper)

['Bolero']
['Schweizer Illustrierte Style']
['Bolero']

print(title)

['Schönheit und Tragik']
['magie pur']
['Das sind unsere Favoriten']

print(pubDate)

['2007-01-01']
['2007-01-01']
['2007-01-01']

It seems to like all variables are a list of lists, but I'm not quite sure. However, since the data is scrapped from a private website, I can't post the entire code here, but I hope this is already enough for you to access what the problem is with that variable format.

What I would like to have is a dataset of this format:

Newspaper Title PubDate
Bolero Schönheit und Tragik 2007-01-01
Schweizer Illustrierte Style magie pur 2007-01-01
Bolero. Das sind unsere Favoriten 2007-01-01
$\endgroup$

1 Answer 1

0
$\begingroup$

First, you need to convert list of list into a list.

From link, you can convert a list of lists into a list by declaring the following function.

flatten = lambda t: [item for sublist in t for item in sublist]

Now all you need is to create dataframe using created lists.

data = {"Newspaper":flatten(newspaper), "Title": flatten(title), "PubDate": flatten(pubDate)}
pd.DataFrame.from_dict(data)
$\endgroup$
7
  • $\begingroup$ I tried that, but then it separated every single letter instead of separating the 3 words. Any idea why? I get something like this: ['B', 'o', 'l', 'e', 'r', 'o'] ['S', 'c', 'h', 'w', 'e', 'i', 'z', 'e', 'r', ' ', 'I', 'l', 'l', 'u', 's', 't', 'r', 'i', 'e', 'r', 't', 'e', ' ', 'S', 't', 'y', 'l', 'e'] ['B', 'o', 'l', 'e', 'r', 'o'] $\endgroup$ Commented Feb 5, 2021 at 19:29
  • $\begingroup$ Can you check the type of variable newspaper by executing type(newspaper)? if does not returns list then try forcibly converting it into a list by executing newspaper = list(newspaper) and execute the code again. Hope it helps. $\endgroup$ Commented Feb 6, 2021 at 17:07
  • $\begingroup$ type(newspaper) returns this:<class 'list'> <class 'list'> <class 'list'>, so it seems to me like the variable consists of 3 lists.. I tried newspaper = list(newspaper) but the output remained the same..any other advice? $\endgroup$ Commented Feb 7, 2021 at 1:08
  • $\begingroup$ Okay, that's weird! What is the output of print(newspaper[0]) and print(dir(newspaper))? I would suggest looking for documentation of the scraping package being used and check for equivalent function for the required format. $\endgroup$ Commented Feb 7, 2021 at 17:29
  • $\begingroup$ I used np_elem = article_elem.find('div', class_='so_txt') to get the elements from the html file and newspaper = pd.Series(np_elem.text) to extract the text. The output of print(newspaper[0]) is Bolero Schweizer Illustrierte Style Bolero and the output of the other is something really long, starting with: ['add', 'class', 'contains', 'delattr', 'delitem', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'getitem', 'gt', 'hash', 'iadd', 'imul', 'init', 'init_subclass', 'iter', '__.Is this what you expected? $\endgroup$ Commented Feb 7, 2021 at 19:48

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.