1. 22 Jul, 2019 1 commit
    • Include corrupt number page in recovery file · c05a677c
      serpucga authored
      Added way of dumping the number of the page that raised an error when
      converting to CSV. This way we can in the future implement some ways of
      dealing with it (skipping the corrupt page, inspecting it to inquire
      where the problem resides, etc.)
  2. 19 Jul, 2019 4 commits
  3. 18 Jul, 2019 4 commits
  4. 17 Jul, 2019 3 commits
    • Thread-safe parallel version (dirty code) · 34170f35
      serpucga authored
      Changed structure of the code to make it thread safe when dumping data
      to the filesystem. Previous parallelism afected all the stages, and that
      could lead to corrupt data when two processes tried to write at the same
      time to the same file. Now the code for retrieving data from Mongo and
      converting it to CSV, named "process_page", because each worker receives
      a page of X (default 1000) tweets to convert, that code is parallelized
      and given to a pool of workers.
      
      However, those workers only write to buffers that they pass to a
      multiprocessing thread-safe queue. That queue is processed by a single
      process, the "filesystem_writer", which is the only one that can write
      to the filesystem (this includes both the creation of the necessary dirs
      and appending tweets to the CSV files). This worker is on an eternal
      loop looking for new data on the queue in order to write it down.
      
      This is a pretty dirty version that includes functions and code taht is
      no longer used and pretty bad log messages used during development to
      hunt down bugs.
      
      Will refactor soon.
    • Forgot to add main module logger · 50dda137
      serpucga authored
    • Added proper logging · 7ff9ad26
      serpucga authored
  5. 16 Jul, 2019 3 commits
    • Trying to make it thread safe, step1 · 866da8f7
      serpucga authored
    • Reformating · 844fabe9
      serpucga authored
    • Changed way of generating metadata file · 2e50b803
      serpucga authored
      Before it was done in the same way than it is done in UTool, by
      increasing an entry in the metadata file by X each time that X tweets
      are added to that CSV. However, for a script that converts from Mongo to
      CSV static collections that are not growing in size, it is just better
      to just count the number of lines of each CSV file once that the
      conversion process has ended.
      
      This also supresses the risk of the metadata being corrupted due to bad
      parallelization
  6. 15 Jul, 2019 5 commits
    • Limited number of processes launched by using Pool · 1f695bf1
      serpucga authored
      Previous version ran out of memory for big databases, because it tried
      to launch all processes at once. This version has no memory issues
      anymore.
      
      Problems with not being thread safe and process collisions prevail.
    • Ignored tests.py · 84849517
      serpucga authored
    • Create one Mongo connection for each process · 33782e37
      serpucga authored
    • First parallel version of the code · ed2c9d74
      serpucga authored
      Parallelized using multiprocessing library. I'm not really sure about
      the code being thread safe. I think we don't care if tweets are appended
      to the files in a different order, but the metadata files being
      corrupted would be problematic. In the first tests the metadata were
      fine, but I think this line is probably not thread safe (two threads
      could load try to update the old value at the same time, resulting in
      inconsistencies):
      
      """
      metadata_file["files"][file_path]["count"] += increase
      """
      
      Apart from that, code is much faster than before.
    • Simpler, more elegant and slightly faster version using the cursors instead of… · 34776b63
      serpucga authored
      Simpler, more elegant and slightly faster version using the cursors instead of building a list of tweets for each page
  7. 12 Jul, 2019 3 commits
  8. 11 Jul, 2019 2 commits
  9. 10 Jul, 2019 2 commits
    • gitignore set to ignore output dir pymongodump · 56c27157
      serpucga authored
    • Initial commit: Mongo to JSON dumper · d1923e7e
      serpucga authored
      Repository contains just one simple script for the moment to dump the
      "tweets" collection of a Mongo database to a JSON file in a
      "pymongodump" directory that is created at the moment and place of
      execution.
      Faster than mongoexport, although the format of the resulting JSONs is
      somewhat different (adapted to Python's syntax).