Domanda

I need to create sparse vectors and I would like to try it using python. I have all of the data needed already to create the vectors, so my task is basically reformatting/rearranging the information that I have.

The input file I have is a 5GB file with 3 tab-separated columns, for example:

abandonment-n   about+n-the+v-know-v    1
abandonment-n   above+ns-j+vn-pass-continue-v   1
abandonment-n   after+n-the+n-a-j-stop-n    1
abandonment-n   as+n-the+ns-j-aid-n 1
cake-n  against+n-the+vg-restv  1
cake-n  as+n-a+vd-require-v 1
cake-n  as+n-a-j+vg-up-use-v    1
cake-n  as+n-the+ns-j-aid-n 2
dog-n   as+n-a-j+vg-up-use-v    7
dog-n   as+n-the+ns-j-aid-n 5

My desired output is the following

2   7
1   1   1   1
1   1   1   2
7   5

where, the first line specifies the dimensions (essentially unique row // col) and the second line begins the actual matrix, in sparse format.

I think the most effective way to do this would be in python. However, as I already have calculated he corresponding weights of the data, I do not think that the classes in numpy or for vectors, such as found here and here are necessary in this case. So, does anyone have any insight how I can begin to tackle this rearranging problem in python?

The first thing that I have thought to do is open the file and split the elements in a dictionary: like this:

mydict = {}
with open("sample_outputDM_ALL_COOC", 'r') as infile_A:
    for line in infile_A:
        lines_splitted = line.split()
        lemma = lines_splitted[0]
        feat = lines_splitted[1]
        weight = lines_splitted[2]
        mydict = [lemma], float(weight)
        #print mydict

        for x in mydict:
            if lemma == lemma:
                print weight + "\t"
            else:
                pass

I have been working very hard on solving this problem and I still have not been able to. What I have done until now is input all of the variables into a dictionary and I am able to print the each individual lemma and each individual weight per row.

However, I need to have all of weights corresponding to a given lemma in the same row. I have tried the groupby variable, but I am not sure that it is the best option for this case. I believe that the solution if the for if else statement, but I cant figure out how to link the two.

Thus, the method should be along the lines of: for every target, print freq of slotfiller in one row for each unique target.

È stato utile?

Soluzione

Is this for homework? If not, check out the tools available in scipy.sparse or a mixture of scikits.learn and Python NLTK (e.g. this example).

Added Based on the comment and re-reading the question, I can also imagine using Pandas.DataFrame to accomplish this, but I am not sure if it will be satisfactory given the size of the data. One option would be to load the data in multiple chunks, since it seems to be parallelizable on unique items of the first column. (See my comment below for more on that).

def sparse_vec(df):
    return (df['Col3'].values[None,:],)

# Obviously these would be chunk-specific, and you'd need to do
# another pass to get the global sum of unique ids from Col1 and the
# global max of the number of unique rows-per-id.
n_cols = len(df.Col2.unique())
n_rows = len(df.Col1.unique())


vecs = df.groupby("Col1").apply(sparse_vec)
print vecs

Using this on the sample data you gave, in IPython, I see this:

In [17]: data = """
   ....: abandonment-n   about+n-the+v-know-v    1
   ....: abandonment-n   above+ns-j+vn-pass-continue-v   1
   ....: abandonment-n   after+n-the+n-a-j-stop-n    1
   ....: abandonment-n   as+n-the+ns-j-aid-n 1
   ....: cake-n  against+n-the+vg-restv  1
ake-   ....: cake-n  as+n-a+vd-require-v 1
   ....: cake-n  as+n-a-j+vg-up-use-v    1
   ....: cake-n  as+n-the+ns-j-aid-n 2
   ....: dog-n   as+n-a-j+vg-up-use-v    7
dog-   ....: dog-n   as+n-the+ns-j-aid-n 5"""

In [18]: data
Out[18]: '\nabandonment-n   about+n-the+v-know-v    1\nabandonment-n   above+ns-j+vn-pass-continue-v   1\nabandonment-n   after+n-the+n-a-j-stop-n    1\nabandonment-n   as+n-the+ns-j-aid-n 1\ncake-n  against+n-the+vg-restv  1\ncake-n  as+n-a+vd-require-v 1\ncake-n  as+n-a-j+vg-up-use-v    1\ncake-n  as+n-the+ns-j-aid-n 2\ndog-n   as+n-a-j+vg-up-use-v    7\ndog-n   as+n-the+ns-j-aid-n 5'

In [19]: data.split("\n")
Out[19]:
['',
 'abandonment-n   about+n-the+v-know-v    1',
 'abandonment-n   above+ns-j+vn-pass-continue-v   1',
 'abandonment-n   after+n-the+n-a-j-stop-n    1',
 'abandonment-n   as+n-the+ns-j-aid-n 1',
 'cake-n  against+n-the+vg-restv  1',
 'cake-n  as+n-a+vd-require-v 1',
 'cake-n  as+n-a-j+vg-up-use-v    1',
 'cake-n  as+n-the+ns-j-aid-n 2',
 'dog-n   as+n-a-j+vg-up-use-v    7',
 'dog-n   as+n-the+ns-j-aid-n 5']

In [20]: data_lines = [x for x in data.split("\n") if x]

In [21]: data_lines
Out[21]:
['abandonment-n   about+n-the+v-know-v    1',
 'abandonment-n   above+ns-j+vn-pass-continue-v   1',
 'abandonment-n   after+n-the+n-a-j-stop-n    1',
 'abandonment-n   as+n-the+ns-j-aid-n 1',
 'cake-n  against+n-the+vg-restv  1',
 'cake-n  as+n-a+vd-require-v 1',
 'cake-n  as+n-a-j+vg-up-use-v    1',
 'cake-n  as+n-the+ns-j-aid-n 2',
 'dog-n   as+n-a-j+vg-up-use-v    7',
 'dog-n   as+n-the+ns-j-aid-n 5']

In [22]: split_lines = [x.split() for x in data_lines]

In [23]: split_lines
Out[23]:
[['abandonment-n', 'about+n-the+v-know-v', '1'],
 ['abandonment-n', 'above+ns-j+vn-pass-continue-v', '1'],
 ['abandonment-n', 'after+n-the+n-a-j-stop-n', '1'],
 ['abandonment-n', 'as+n-the+ns-j-aid-n', '1'],
 ['cake-n', 'against+n-the+vg-restv', '1'],
 ['cake-n', 'as+n-a+vd-require-v', '1'],
 ['cake-n', 'as+n-a-j+vg-up-use-v', '1'],
 ['cake-n', 'as+n-the+ns-j-aid-n', '2'],
 ['dog-n', 'as+n-a-j+vg-up-use-v', '7'],
 ['dog-n', 'as+n-the+ns-j-aid-n', '5']]

In [24]: df = pandas.DataFrame(split_lines, columns=["Col1", "Col2", "Col3"])

In [25]: df
Out[25]:
            Col1                           Col2 Col3
0  abandonment-n           about+n-the+v-know-v    1
1  abandonment-n  above+ns-j+vn-pass-continue-v    1
2  abandonment-n       after+n-the+n-a-j-stop-n    1
3  abandonment-n            as+n-the+ns-j-aid-n    1
4         cake-n         against+n-the+vg-restv    1
5         cake-n            as+n-a+vd-require-v    1
6         cake-n           as+n-a-j+vg-up-use-v    1
7         cake-n            as+n-the+ns-j-aid-n    2
8          dog-n           as+n-a-j+vg-up-use-v    7
9          dog-n            as+n-the+ns-j-aid-n    5

In [26]: df.groupby("Col1").apply(lambda x: (x.Col3.values[None,:],))
Out[26]:
Col1
abandonment-n    (array([[1, 1, 1, 1]], dtype=object),)
cake-n           (array([[1, 1, 1, 2]], dtype=object),)
dog-n                  (array([[7, 5]], dtype=object),)

In [27]: n_rows = len(df.Col1.unique())

In [28]: n_cols = len(df.Col2.unique())

In [29]: n_rows, n_cols
Out[29]: (3, 7)
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top