Counting 2,653 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

1.4 Billion Text Credentials Analysis (NLP)

Using deep learning and NLP to analyze a large corpus of clear text passwords.


Disclaimer: for research purposes only.

In the press

Get the data

  • Download any Torrent client.
  • Here is a magnet link you can find on Reddit:
    • magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&

Deep Learning

  • Stay tuned!

Map the password list for each email

Generate the JSON files containing emails <-> list of passwords. Output folder is ~/BreachCompilationAnalysis.

python3 --breach_compilation_folder ~/BreachCompilation
  • Make sure you have enough free memory (8GB should be enough).
  • It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
  • Uncompressed output is 13G.

Output is of the form:

> less ReducePasswordsOnSimilarEmailsCallback-z-b.json # emails starting with zb.
    "[email protected]": [
    "[email protected]": [