1. 17 Jul, 2019 2 commits
  2. 16 Jul, 2019 3 commits
    • Trying to make it thread safe, step1 · 866da8f7
      serpucga authored
    • Reformating · 844fabe9
      serpucga authored
    • Changed way of generating metadata file · 2e50b803
      serpucga authored
      Before it was done in the same way than it is done in UTool, by
      increasing an entry in the metadata file by X each time that X tweets
      are added to that CSV. However, for a script that converts from Mongo to
      CSV static collections that are not growing in size, it is just better
      to just count the number of lines of each CSV file once that the
      conversion process has ended.
      
      This also supresses the risk of the metadata being corrupted due to bad
      parallelization
  3. 15 Jul, 2019 5 commits
    • Limited number of processes launched by using Pool · 1f695bf1
      serpucga authored
      Previous version ran out of memory for big databases, because it tried
      to launch all processes at once. This version has no memory issues
      anymore.
      
      Problems with not being thread safe and process collisions prevail.
    • Ignored tests.py · 84849517
      serpucga authored
    • Create one Mongo connection for each process · 33782e37
      serpucga authored
    • First parallel version of the code · ed2c9d74
      serpucga authored
      Parallelized using multiprocessing library. I'm not really sure about
      the code being thread safe. I think we don't care if tweets are appended
      to the files in a different order, but the metadata files being
      corrupted would be problematic. In the first tests the metadata were
      fine, but I think this line is probably not thread safe (two threads
      could load try to update the old value at the same time, resulting in
      inconsistencies):
      
      """
      metadata_file["files"][file_path]["count"] += increase
      """
      
      Apart from that, code is much faster than before.
    • Simpler, more elegant and slightly faster version using the cursors instead of… · 34776b63
      serpucga authored
      Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
  4. 12 Jul, 2019 3 commits
  5. 11 Jul, 2019 2 commits
  6. 10 Jul, 2019 2 commits
    • gitignore set to ignore output dir pymongodump · 56c27157
      serpucga authored
    • Initial commit: Mongo to JSON dumper · d1923e7e
      serpucga authored
      Repository contains just one simple script for the moment to dump the
      "tweets" collection of a Mongo database to a JSON file in a
      "pymongodump" directory that is created at the moment and place of
      execution.
      Faster than mongoexport, although the format of the resulting JSONs is
      somewhat different (adapted to Python's syntax).