Counting 2,653 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

1.4 Billion Text Credentials Analysis (NLP)

Using deep learning and NLP to analyze a large corpus of clear text passwords.

Objectives:

Disclaimer: for research purposes only.

In the press

Get the data

  • Download any Torrent client.
  • Here is a magnet link you can find on Reddit:
    • magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fglotorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337

Deep Learning

  • Stay tuned!

Map the password list for each email

Generate the JSON files containing emails <-> list of passwords. Output folder is ~/BreachCompilationAnalysis.

python3 read.py --breach_compilation_folder ~/BreachCompilation
  • Make sure you have enough free memory (8GB should be enough).
  • It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
  • Uncompressed output is 13G.

Output is of the form:

> less ReducePasswordsOnSimilarEmailsCallback-z-b.json # emails starting with zb.
{
    "[email protected]": [
        "pass1",
        "pass2"
    ],
    "[email protected]": [
        "pass1",
        "pass2",
        "pass3"
    ],
    [...]
}