Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop, PHP & Symfony - SF Live San Francisco 2017

Michael C.
October 19, 2017

Hadoop, PHP & Symfony - SF Live San Francisco 2017

Michael C.

October 19, 2017
Tweet

More Decks by Michael C.

Other Decks in Technology

Transcript

  1. @MICHAELCULLUMUK HADOOP 2003: GOOGLE FILE SYSTEM
 2004: MAPREDUCE: SIMPLIFIED DATA

    PROCESSING 
 ON LARGE CLUSTERS
 2006: HADOOP DEVELOPMENT BEGINS (AND NAMED AFTER A TOY ELEPHANT)
  2. @MICHAELCULLUMUK HADOOP 2006: 1.8 TB IN 47.9 HOURS
 2007: 1PB

    IN 12.13 HOURS, 1.37 TB/MIN
 2008: 1PB IN 6.03 HOURS, 2.76 TB/MIN
 2010: 1PB, 2.95 HOURS, 5.65 TB/MIN
 2011: 1PB, 0.55 HOURS, 30.3 TB/MIN
 2012: 50PB, 23 HOURS, 36.2 TB/MIN
 2014: SPARK ALLOWED - SORT 5 TIMES AS FAST. 10 TIMES FEWER NODES
  3. OTHER TOOLS ▸ Hive - Relational style database ▸ Bigtop

    - Quickly setup a test cluster ▸ Pig - High level programming language for MapReduce jobs ▸ Sqoop - For importing/reading MySQL and other RDBMS ▸ Spark - Alternative to MapReduce designed for fast analytics ▸ Flume - Streaming data collection / aggregation manager ▸ Oozie - MapReduce Workflow Manager & Scheduler ▸ Whirr - Deployment of clusters to AWS ▸ HBase - Low-latency distributed, non- relational database ▸ Zookeeper - Distributed application HA management ▸ HCatalog - Interop between Pig and Hive
  4. @MICHAELCULLUMUK ▸ Hadoop HDFS: How data is stored, accessed and

    distributed under the hood ▸ Hive: Using Hadoop as an RDBMS; writing ▸ Presto: A Facebook library we can use to query the cluster; reading ▸ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS
  5. @MICHAELCULLUMUK READING A FILE FROM HDFS file.txt 1: 192.168.0.2, 192.168.0.5


    2: 192.168.0.3, 192.168.0.12 Namenode Datanode: 191.168.0.5 Datanode: 192.168.0.3 Read block 1 {content} {content} Read block 2 Client Library
  6. @MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed

    and distributed under the hood ▸ Hive: Using Hadoop as an RDBMS; importing files ▸ Presto: A Facebook library we can use to query the cluster ▸ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS
  7. IMPORTING TO HIVE IS EASY while read filename; do echo

    $filename hadoop fs -put /home/michael/for_hadoop_import/$filename /tmp/ echo "LOAD DATA INPATH '/tmp/$filename' INTO TABLE temp_csv; INSERT INTO TABLE temp_orc SELECT * FROM temp_csv; TRUNCATE TABLE temp_csv;" | hive done
  8. @MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed

    and distributed under the hood ▸ ✓ Hive: Using Hadoop as an RDBMS; importing files ▸ Presto: A Facebook library we can use to query the cluster ▸ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS
  9. @MICHAELCULLUMUK ▸ Hive ▸ MySQL ▸ Cassandra ▸ MongoDB ▸

    PostgreSQL ▸ Redis ▸ SQL Server ▸ JMX ▸ REST API ▸ Local files ▸ Memory CONNECTORS
  10. @MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed

    and distributed under the hood ▸ ✓ Hive: Using Hadoop as an RDBMS; importing files ▸ ✓ Presto: A Facebook library we can use to query the cluster ▸ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS
  11. @MICHAELCULLUMUK SIMPLE PHP CLIENT $socket = new \SamKnows\Phresto\Client\RemoteHost('http', ‘coordinator.hostname.com', '8080');

    $connection = new \SamKnows\Phresto\Client\HttpConnection($socket, new NullLogger(), 'Michael\'s Macbook', 'Michael', '', 0, 0, 0, false); $result = $connection->executeQuery("SELECT * FROM table WHERE id=5')", 'hive', ‘database_name’); $resultClass = $result->getResult();
  12. @MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed

    and distributed under the hood ▸ ✓ Hive: Using Hadoop as an RDBMS; importing files ▸ ✓ Presto: A Facebook library we can use to query the cluster ▸ ✓ Phresto: A library to make Presto accessible from PHP userland ▸ Phresto & Doctrine CONTENTS
  13. @MICHAELCULLUMUK DOCTRINE DBAL / PDO-STYLE INTERFACE $configuration = new Configuration();

    $params = [ 'host' => 'coordinator.hostname.com', 'port' => '8080', 'user' => 'Michael', 'password' => '', 'source' => __FILE__, 'protocol' => 'http', 'catalog' => 'hive', 'schema' => 'database_name', 'driverClass' => Driver::class, ]; $connection = DriverManager::getConnection($params, $configuration); $result = $connection->executeQuery('SELECT * FROM messages LIMIT 1'); $result->fetchAll(); foreach ($result->fetch() as $row) { var_dump($row); }
  14. @MICHAELCULLUMUK SYMFONY # Doctrine Configuration doctrine: dbal: default_connection: presto connections:

    presto: driver_class: '\SamKnows\Phresto\Doctrine\DBAL\Driver\Phresto\Driver' host: 'coordinator.hostname.com' port: '8080' user: 'Michael' password: 'password' charset: UTF8 options: source: 'Michaels MBP' catalog: 'hive' schema: 'table_name' protocol: 'http' Use parameters for these
 configuration values
  15. @MICHAELCULLUMUK REPOSITORY <?php use Doctrine\DBAL\Connection; class AnalyticsRepository { private $doctrine;

    public function __construct(Connection $doctrine) { $this->doctrine = $doctrine; } public function getAveragesRttByPool(): array { $result = $this->connection->query("SELECT avg(rtt), pool GROUP BY pool") ->fetchAll(); return $result; } }
  16. @MICHAELCULLUMUK ▸ ✓ Hadoop HDFS: How data is stored, accessed

    and distributed under the hood ▸ ✓ Hive: Using Hadoop as an RDBMS; importing files ▸ ✓ Presto: A Facebook library we can use to query the cluster ▸ ✓ Phresto: A library to make Presto accessible from PHP userland ▸ ✓ Phresto & Doctrine CONTENTS