Commits · 844fabe91ea0f4c03c40d1dcae2826185af32e54 · serpucga / migration_scripts

16 Jul, 2019 2 commits

Reformating · 844fabe9
serpucga authored Jul 16, 2019

844fabe9

Changed way of generating metadata file · 2e50b803

authored Jul 16, 2019

Before it was done in the same way than it is done in UTool, by
increasing an entry in the metadata file by X each time that X tweets
are added to that CSV. However, for a script that converts from Mongo to
CSV static collections that are not growing in size, it is just better
to just count the number of lines of each CSV file once that the
conversion process has ended.

This also supresses the risk of the metadata being corrupted due to bad
parallelization

2e50b803

15 Jul, 2019 5 commits

Limited number of processes launched by using Pool · 1f695bf1

authored Jul 15, 2019

Previous version ran out of memory for big databases, because it tried
to launch all processes at once. This version has no memory issues
anymore.

Problems with not being thread safe and process collisions prevail.

1f695bf1

Ignored tests.py · 84849517
serpucga authored Jul 15, 2019

84849517
Create one Mongo connection for each process · 33782e37
serpucga authored Jul 15, 2019

33782e37

First parallel version of the code · ed2c9d74

authored Jul 15, 2019

Parallelized using multiprocessing library. I'm not really sure about
the code being thread safe. I think we don't care if tweets are appended
to the files in a different order, but the metadata files being
corrupted would be problematic. In the first tests the metadata were
fine, but I think this line is probably not thread safe (two threads
could load try to update the old value at the same time, resulting in
inconsistencies):

"""
metadata_file["files"][file_path]["count"] += increase
"""

Apart from that, code is much faster than before.

ed2c9d74

Simpler, more elegant and slightly faster version using the cursors instead of… · 34776b63
serpucga authored Jul 15, 2019
```
Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
```
34776b63

12 Jul, 2019 3 commits
- Bugfix: was writing the metadata instead of the header at the beggining of each file · e093713a
  serpucga authored Jul 12, 2019
  
  e093713a
- Added pagination system to elude memory issues · 3cab07f8
  serpucga authored Jul 12, 2019
  
  3cab07f8
- Moved auxiliar functions out of the main script · 16f47745
  serpucga authored Jul 12, 2019
  
  16f47745
11 Jul, 2019 2 commits
- Script working (seemingly fine) · 95eb843f
  serpucga authored Jul 11, 2019
  
  95eb843f
- Exporting to JSON and to CSV separated. Implemented creation of filesystem tree to store the CSVs · dcc81cb0
  serpucga authored Jul 11, 2019
  
  dcc81cb0
10 Jul, 2019 2 commits

gitignore set to ignore output dir pymongodump · 56c27157
serpucga authored Jul 10, 2019

56c27157

Initial commit: Mongo to JSON dumper · d1923e7e

authored Jul 10, 2019

Repository contains just one simple script for the moment to dump the
"tweets" collection of a Mongo database to a JSON file in a
"pymongodump" directory that is created at the moment and place of
execution.
Faster than mongoexport, although the format of the resulting JSONs is
somewhat different (adapted to Python's syntax).

d1923e7e