2

I have a train_file.txt which has 3 columns on each row.

For example;

1 10 1

1 12 1

2 64 2

6 17 1

...

I am reading this txt file with

train_data = open("train_file.txt", 'r').readlines()

Then I am trying to get each value with for loop

for eachline in train_data:
    uid, lid, x = eachline.strip().split()

Question: Train data is a huge file that's why I want to just get the first 1000 rows.

I was trying to execute the following code but I am getting an error ('list' object cannot be interpreted as an integer)

for eachline in range(train_data,1000)
        uid, lid, x = eachline.strip().split()

5 Answers 5

6

It is not necessary to read the entire file at all. You could use enumerate on the file directly and break early or use itertools.islice:

from itertools import islice

train_data = list(islice(open("train_file.txt", 'r'), 1000))

You can also keep using the same file handle to read more data later:

f = open("train_file.txt", 'r')
train_data = list(islice(f, 1000)) # reads first 1000
test_data = list(islice(f, 100))   # reads next 100
Sign up to request clarification or add additional context in comments.

Comments

2

Maybe try changing this line:

train_data = open("train_file.txt", 'r').readlines()

To:

train_data = open("train_file.txt", 'r').readlines()[:1000]

Comments

2

train_data is a list, use slicing: for eachline in train_data[:1000]:

As the file is "huge" in your words a better approach is to read just first 1000 rows (readlines() will read the whole file in memory)

with open("train_file.txt", 'r'):
    train_data = []
    for idx, line in enumerate(f, start=1):
        train_data.append(line.strip.split())
        if idx == 1000:
            break

Note that data will be str, not int. You probably want to convert them to int.

Comments

1

You could use enumerate and a break:

for k, line in enumerate(lines):
    if k > 1000: 
        break # exit the loop

    # do stuff on the line

Comments

1

I would recommend using the csv built in library since the data is csv-like (or the pandas one if you're using it), and using with. So something like this:

import csv
from itertools import islice

with open('./test.csv', 'r') as input_file:
  csv_reader = csv.reader(input_file, delimiter=' ')
  rows = list(islice(csv_reader, 1000))

# Use rows
print(rows)

You don't need it right now but it will make escaped characters or multiline entries way easier to parse. Also, if there are headers you can use csv.DictReader to include them.

Regarding your original code:

  • The call the readlines() will read all lines at that point so doing any filtering after won't make a difference.
  • If you did read it that way, to get the first 1000 lines your for loop should be:
for eachline in traindata[:1000]:
  ...

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.